Compute Pools

A compute pool determines the compute resources allocated to process your queries within a data plane. When you execute a query, the compute pool controls how much processing power is available and whether those resources are shared with other users or dedicated to your workload. Compute pools are one of the four dimensions of your execution context, alongside data plane, database, and schema.

Compute pool types

Dedicated

Dedicated compute pools provide isolated resources reserved for your workloads. Your queries don’t compete with other users for processing power, which results in more predictable performance. Use dedicated compute pools when:

Running production workloads where performance consistency matters
Processing large or complex queries that need guaranteed resources
Operating time-sensitive pipelines where latency must stay predictable

Shared

Shared compute pools use pooled resources across multiple users. This is more cost-effective but means your query performance may vary depending on current platform load. Use shared compute pools when:

Running exploratory queries or ad-hoc analysis
Developing and testing queries before promoting to production
Working with smaller datasets where performance variability is acceptable

Snowflake warehouse

On Snowflake-based data planes, each compute pool maps to a Snowflake virtual warehouse. When you register warehouses through the Snowflake Native App, each warehouse becomes a compute pool on your data plane. You can register multiple warehouses to separate workloads—for example, a smaller warehouse for exploratory queries and a larger one for production pipelines. Each Snowflake compute pool has a collaboration policy that controls which companies can use it, and one pool can be designated as the default for the data plane.

AWS EMR

On AWS-based data planes, compute pools map to Amazon EMR Spark clusters that the data plane operator provisions, reuses across jobs targeting the same pool, and terminates when idle. Each pool has a configured size that determines the cluster’s worker memory budget and vCPU count. The operator schedules NQL jobs—including materialize-view, nql-forecast, datasets_sample, and datasets_execute_dml—onto the appropriate cluster based on the compute pool selected for the workload. Sizes mirror Snowflake’s warehouse vocabulary (x_small through 6x_large) and target the same worker memory budget as the equivalent Snowflake warehouse, so a workload that fits in a Snowflake warehouse of size N gets equivalent RAM on the EMR cluster of size N. vCPU counts won’t match Snowflake because EMR uses memory-optimized instances (8 GiB per vCPU), so the same memory budget delivers fewer vCPUs.

Size	Worker memory	vCPU (max)	Notes
`x_small`	~32 GiB	~4	Fixed size; rounded up from Snowflake’s 16 GiB minimum
`small`	~32 GiB	~4	Fixed size
`medium`	~64 GiB	~8	Fixed size
`large`	~128 GiB	~16	Fixed size
`x_large`	~256 GiB	~32	EMR Managed Scaling
`2x_large`	~512 GiB	~64	EMR Managed Scaling
`3x_large`	~1 TiB	~128	EMR Managed Scaling
`4x_large`	~2 TiB	~256	EMR Managed Scaling
`5x_large`	~4 TiB	~512	EMR Managed Scaling
`6x_large`	~8 TiB	~1024	EMR Managed Scaling

Sizes x_large and above use EMR Managed Scaling: the cluster boots small and expands toward the maximum based on YARN load, then contracts back down when idle. Sizes large and below run at a fixed instance count.

Instance-storage (`_storage`) variants

Every size has a sibling _storage variant — x_small_storage, small_storage, medium_storage, large_storage, x_large_storage, 2x_large_storage, 3x_large_storage, 4x_large_storage, 5x_large_storage, 6x_large_storage — with the same worker memory budget and node counts as its base size, but with core and task nodes running on Graviton instances with local NVMe SSD (r7gd/r8gd) instead of EBS-backed families. The master node stays on m6g (it does no shuffle). Use a _storage variant when a job spills a large volume of shuffle or scratch data to disk and fails on an EBS-backed pool with No space left on device. Typical culprits are wide joins, large GROUP BY / DISTINCT aggregations, and materialized-view refreshes over very wide datasets. For jobs that are CPU- or memory-bound rather than shuffle-bound, stick with the base size — the local-NVMe families cost more per node. You switch between the base and _storage tiers the same way you change any other size: update the pool’s provider.size via PATCH /data-planes/{id}/compute-pools/{poolId}, or create a new pool sized for the workload and pin it on the affected job or dataset.

Idle and job execution timeouts

EMR-backed pools expose two optional tunables that control cluster lifetime and per-job runtime. Both are set on the aws_emr provider when you create or update a pool, and both are validated server-side — invalid values return HTTP 400.

Field	Range	Default on create	What it controls
`idle_timeout_seconds`	`-1`, or `60`–`604800` (7 days)	`900` (15 minutes)	How long an EMR cluster sits idle before auto-termination. `-1` disables idle-termination entirely; the cluster runs until it is explicitly terminated or recycled.
`job_execution_timeout_seconds`	`60`–`604800` (7 days)	`14400` (4 hours)	Maximum time a single job’s EMR step may stay in `RUNNING`. When exceeded, the operator cancels the step and fails the job with an explanatory message.

When either field is omitted (or set to null) on create or update, the API fills in the default above before validating and persisting the pool. Explicit values — including the -1 disabled-idle sentinel — are always preserved. This matters when you update the pool’s size (for example, switching to a _storage variant): the provider block is replaced wholesale rather than merged, so a size-only update that omits the timeout fields still gets the standard 4h/15m defaults instead of silently dropping them. Set idle_timeout_seconds to -1 when you want a long-lived cluster — for example, a pool dedicated to back-to-back batch jobs where boot latency dominates run time.

The shared Narrative compute pool is configured with a 1-hour job execution timeout and is intended for small jobs. Pools you provision for yourself get the 4-hour default on create, and can set their own (longer or shorter) limit.

Which compute pools are available

The compute pool options available to you depend on your data plane’s underlying provider:

Provider	Available compute pools	Notes
Snowflake	Snowflake warehouse	One compute pool per registered warehouse
Narrative (shared AWS)	Dedicated, Shared	Choose based on workload requirements
Customer AWS	AWS EMR, Dedicated, Shared	EMR-backed pools support sized Spark clusters for NQL workloads

You select your compute pool through the context selector in the platform’s top navigation.

Default compute pool resolution

When a job is created without an explicit compute pool, the platform resolves one through a four-level fallback chain. Resolution happens at job-creation time, so a job’s compute_pool_id is fixed before it lands in Pending. The first level present wins.

Level	Source	When it applies
1. Job-specific	The `computePoolId` passed in the request body (or a workflow task input)	A specific job needs a non-default pool
2. Dataset default	`dataset.computePoolConfig.defaultComputePoolId`	Dataset-scoped operations (refresh, sample, execute-DML, stats) when the request didn’t pin a pool
3. Company default (per data plane)	`companies.compute_pool_config.by_data_plane[<dataPlaneId>].default_compute_pool_id`	The company has a catch-all default for the job’s data plane and nothing earlier supplied a pool. Scoped per data plane because a company can own pools across multiple data planes. Covers the wider surface (model training, model inference, healthcheck) that has no dataset analog.
4. Data plane default	`dataPlane.defaultComputePoolId`	Catch-all when nothing else resolved

A company admin can set the level-3 default for a given data plane with:

PUT /company/{companyId}/data-planes/{dataPlaneId}/default-compute-pool/{computePoolId}

and clear it with:

DELETE /company/{companyId}/data-planes/{dataPlaneId}/default-compute-pool

Both endpoints require the Company Info write permission and return 204 No Content on success.

Use a dataset default for “all my materialize-view refreshes on this dataset run on a large pool”, and a company default for the broader set of operations that have no dataset to attach to — model inference, healthchecks, or non-dataset NQL workloads. The two configs are intentionally separate so each can grow operation-specific knobs without polluting the other.

When to use each type

Scenario	Recommended pool	Why
Production data pipelines	Dedicated	Predictable performance, no resource contention
Ad-hoc data exploration	Shared	Cost-effective for variable, low-priority workloads
Testing queries before production	Shared	Saves dedicated resources for production use
Time-sensitive audience builds	Dedicated	Guaranteed resources ensure timely completion
Snowflake data planes	Snowflake warehouse	Register one or more warehouses sized for your workload

Archiving compute pools

Deleting a compute pool is a soft delete: the pool’s status changes to archived, the record stays in place, and it is filtered out of list responses but still resolvable by id. Archival behavior diverges by provider because each integration has different operational realities:

Reference path	Snowflake data planes	AWS data planes
Pool is the data plane default	The platform re-elects another active pool as the new default (or clears the default if none remain), then archives the pool	The archive is rejected; you must clear or change the data plane default before retrying
Pool is set as a dataset default	The platform clears the default on every affected dataset, then archives the pool. Affected datasets fall through to the rest of the chain on their next job	The archive is rejected; you must clear the dataset defaults before retrying
Pool is set as a company default (per data plane)	The platform clears the per-data-plane entry on every affected company, then archives the pool. Companies fall through to the data plane default on their next job	The archive is rejected; you must clear the company defaults before retrying
Pool has in-flight `Pending` or `Running` jobs	Archive succeeds; jobs are handled at runtime (see below)	Archive succeeds; jobs are handled at runtime (see below)

Snowflake users frequently mutate their warehouse list outside the platform — through the Snowflake UI, permissions, or infrastructure as code. When a backing warehouse disappears, the platform has to track reality, so the archive must succeed and dependent references are updated automatically. AWS compute pools are wholly managed inside the platform, so the API rejects an archive that would leave a dangling default and forces an explicit decision.

Runtime impact on in-flight jobs

Neither archive flow blocks in-flight jobs at archive time. Instead, the data plane operator fails any job whose compute pool resolves to a non-active status on its next polling iteration. The job reports as failed with an actionable message similar to:

compute pool '<id>' could not be resolved (it may have been archived or does not exist).
Resubmit the job with an active compute pool.

This is uniform across Pending, Running, and PendingCancellation jobs. Jobs in PendingCancellation are failed rather than cancelled because the missing pool is the actionable signal — a successful cancellation report would mask the root cause. See the Archive a compute pool guide for the API workflow and the order in which to clear references.

On AWS data planes, the first compute pool you create is not automatically promoted to the data plane default. This is the permanent behavior: jobs targeting an AWS data plane must either pin an explicit computePoolId, resolve to a dataset-level default, or have a data plane default set explicitly via PUT /data-planes/{id}/default-compute-pool/{poolId}. The intent is that AWS workloads make a deliberate choice about where they run rather than implicitly routing through a “first pool wins” default.On Snowflake data planes, the first compute pool you create is currently auto-promoted to the data plane default for backward compatibility while the Snowflake-side migration off implicit defaults is in flight. This behavior is temporary and will be removed once every Snowflake workload pins its compute pool explicitly.

Every newly registered company is provisioned a private, x-small AWS EMR compute pool on the Narrative data plane and that pool is set as the company-level default for the same data plane. Jobs that don’t pin a pool explicitly resolve to this default through the pool resolution chain (job → dataset → company → data plane). You can rename, archive, or replace the default at any time.

`external_id` is optional for `aws_emr` providers

When creating a compute pool with the aws_emr provider via POST /data-planes/{id}/compute-pools, you can omit external_id from the request body — the server fills in the trivial {"type": "aws_emr"} payload automatically. Supplying it explicitly still works, and the type is validated against the provider. For the snowflake_warehouse provider, external_id is still required because it carries the warehouse name and alias that the platform uses to dispatch queries to the correct Snowflake warehouse.

How compute pools relate to the SDK

When executing queries through the TypeScript SDK, the execution_cluster parameter maps to the compute pool concept:

const result = await api.executeNql({
  nql: 'SELECT _nio_id, _nio_updated_at FROM company_data."my_dataset" LIMIT 100',
  data_plane_id: null,
  execution_cluster: { type: 'dedicated' },
});

The execution_cluster.type accepts 'dedicated' or 'shared', corresponding directly to the Dedicated and Shared compute pool types. If omitted, the data plane’s default compute pool is used. For Snowflake-based data planes, omitting execution_cluster uses the data plane’s default compute pool (the warehouse you’ve designated as default).

Execution Context

How data plane, compute pool, database, and schema work together

Data Planes

Where your data lives and is processed

Executing NQL Queries

Run queries programmatically with the TypeScript SDK

Migrate to Compute Pools

Transition from a single Snowflake warehouse to compute pools

Archive a Compute Pool

Safely retire a compute pool and clear its references

Overview

Core Primitives

Rosetta Stone

NQL

Data Formats

Identifiers

Architecture

Workflows

Webhooks

Data Collaboration MCP Server

Model Inference

Security

Compliance

Data Activation

Apps

Compute pool types

Dedicated

Shared

Snowflake warehouse

AWS EMR

Instance-storage (`_storage`) variants

Idle and job execution timeouts

Which compute pools are available

Default compute pool resolution

When to use each type

Archiving compute pools

Runtime impact on in-flight jobs

`external_id` is optional for `aws_emr` providers

How compute pools relate to the SDK

Execution Context

Data Planes

Executing NQL Queries

Migrate to Compute Pools

Archive a Compute Pool

​Compute pool types

​Dedicated

​Shared

​Snowflake warehouse

​AWS EMR

​Instance-storage (_storage) variants

​Idle and job execution timeouts

​Which compute pools are available

​Default compute pool resolution

​When to use each type

​Archiving compute pools

​Runtime impact on in-flight jobs

​external_id is optional for aws_emr providers

​How compute pools relate to the SDK

​Related content

Execution Context

Data Planes

Executing NQL Queries

Migrate to Compute Pools

Archive a Compute Pool

Compute pool types

Dedicated

Shared

Snowflake warehouse

AWS EMR

Instance-storage (`_storage`) variants

Idle and job execution timeouts

Which compute pools are available

Default compute pool resolution

When to use each type

Archiving compute pools

Runtime impact on in-flight jobs

`external_id` is optional for `aws_emr` providers

How compute pools relate to the SDK

Related content