API endpoints
| Method | Path | Description |
|---|---|---|
PUT | /datasets/{dataset_id}/statistics-configuration | Create or replace the statistics configuration |
GET | /datasets/{dataset_id}/statistics-configuration | Retrieve the current configuration |
DELETE | /datasets/{dataset_id}/statistics-configuration | Disable statistics (stop computing) |
Configuration schema overview
A statistics configuration has four top-level fields:defaults (required)
The global stat selection applied to every eligible field. Must include enabled_stats — there is no parent to inherit from at this level.
refresh (required)
Controls when statistics are recomputed. Exactly one trigger type:
| Trigger | Payload | Behavior |
|---|---|---|
cron | { "trigger": "cron", "cron_expression": "0 0 * * *" } | Recompute on a UTC cron schedule |
on_update | { "trigger": "on_update" } | Recompute whenever dataset data changes |
manual | { "trigger": "manual" } | Only compute on demand |
dataset and rosetta_stone (optional)
Namespace-level overrides. Each namespace has:
scope— overrides defaults for all fields in this namespacefields— per-field overrides for individual columns/attributes
scope or fields must be provided if the namespace is present.
Supported statistics
Each statistic is computed per column. The computation column shows the underlying SQL-level operation.| Stat name | Description | Computation |
|---|---|---|
value_count | Total number of non-null values in the column | COUNT(column) |
null_value_count | Number of null values in the column | COUNT(*) - COUNT(column) |
nan_value_count | Number of NaN (not-a-number) values — floating-point columns only | COUNT_IF(IS_NAN(column)) |
lower_bound | Minimum value in the column | MIN(column) |
upper_bound | Maximum value in the column | MAX(column) |
approx_count_distinct | Approximate number of distinct values, using HyperLogLog for efficiency on large datasets | APPROX_COUNT_DISTINCT(column) |
count_distinct | Exact number of distinct values | COUNT(DISTINCT column) |
histogram | Frequency distribution of values across distinct buckets | Aggregation over value frequencies |
mean | Arithmetic mean — numeric columns only | AVG(column) |
standard_deviation | Population standard deviation — numeric columns only | STDDEV(column) |
completeness | Ratio of non-null values to total rows (0 to 1) | COUNT(column) / COUNT(*) |
Type compatibility matrix
Not all statistics apply to all data types. The table below shows which statistics are computed for each type.| Type | Supported statistics |
|---|---|
DoubleType | All 11: value_count, null_value_count, nan_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness |
LongType | 10 (all except nan_value_count): value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness |
StringType | 8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness |
TimestampTzType | 8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness |
BooleanType | 6: value_count, null_value_count, approx_count_distinct, count_distinct, histogram, completeness |
ArrayType | 2: value_count, null_value_count |
ObjectType | 2: value_count, null_value_count |
DoubleType is the only type that supports nan_value_count, since NaN is a floating-point concept. mean and standard_deviation are limited to numeric types (DoubleType and LongType). lower_bound and upper_bound require ordered types (numeric, timestamp, or string).Inheritance model
The configuration uses a layered inheritance chain:1. Stat set (enabled_stats) — all-or-nothing replacement
When a level provides enabled_stats, it completely replaces the parent’s stat set. There is no merging.
2. Stat options (stat_options) — per-property inheritance
When a level provides only stat_options (without enabled_stats), it inherits the stat set from its parent and overrides only the specified option properties. Unspecified properties keep the value from the resolved parent — not from the original defaults.
This means overrides accumulate through the chain:
max_bins=200 from scope, not max_bins=50 from defaults.
Primitive vs. non-primitive fields
Fields map to one of two node kinds based on their schema type:| Node kind | Schema types | Available config fields |
|---|---|---|
| Primitive | string, integer, float, boolean, timestamp | enabled_stats, stat_options |
| Non-primitive | object, array | self, properties, items |
self— stats on the container column itself (onlynull_value_countis supported)properties— explicit configs for named children of an object (opt-in, not inherited)items— config for array element nodes
Example 1: Simple defaults-only configuration
The simplest useful configuration: define defaults and a refresh trigger, and every eligible field gets the same stats. Goal: Computevalue_count, null_value_count, and completeness for all fields, refreshing daily at midnight UTC.
mean on a string column) are automatically filtered.
Adding histogram with custom options
Extend the defaults to include a histogram with 200 bins and truncation overflow:Disabling stats for a specific namespace
If you want stats on dataset columns but not Rosetta Stone attributes:enabled_stats: [] explicitly disables all stats for Rosetta Stone fields.
Example 2: Complex configuration with nested structures
Scenario: A dataset with these columns:user_id(string) — needs exact distinct countprice(float) — needs histogram with custom binsaddress(object with childrencity,zip,state) — need stats oncityandziponlytags(array of strings) — need stats on array items, and null count on the array column itself
Step 1: Set defaults
Start with a broad set of defaults:Step 2: Add namespace scope to tune histogram bins
Override histogram bins for all dataset fields to 200, while keeping the stat set inherited from defaults:stat_options (no enabled_stats), it inherits the full stat set from defaults. Only max_bins is overridden — overflow retains the server default of none.
Step 3: Add per-field overrides
Now add the field-specific configurations:What each field resolves to
| Field | Resolution | Effective stats |
|---|---|---|
user_id | enabled_stats replaces inherited set | value_count, count_distinct |
price | Inherits stat set from scope (which inherited from defaults); overrides overflow | value_count, null_value_count, histogram(max_bins=200, overflow=truncate), completeness |
address (self) | Only null_value_count (the only stat supported for non-primitive self) | null_value_count |
address.city | enabled_stats replaces inherited set | value_count, approx_count_distinct |
address.zip | enabled_stats with custom histogram options | value_count, null_value_count, histogram(max_bins=1000, overflow=truncate) |
address.state | Not listed in properties — receives no stats | (none) |
tags (self) | Only null_value_count for the array column | null_value_count |
tags (items) | enabled_stats replaces inherited set | value_count, approx_count_distinct, histogram(max_bins=200, overflow=none) |
The
address.state field gets no stats because it is not listed in the properties array. Nested property configuration is opt-in — unlisted children are skipped entirely.Validation errors
When a configuration violates a constraint, the API returns a400 Bad Request with a descriptive error message. Multiple validation errors may be reported in a single response, separated by semicolons.
Each error is documented with its exact trigger condition, an example request that produces it, and how to fix it:
Missing Required Default Statistics
defaults must include enabled_stats
Duplicate Statistics
Same stat listed more than once
Conflicting Distinct Count Methods
approx_count_distinct + count_distinct together
Invalid Histogram Bin Count
max_bins outside [2, 100000]
Histogram Options Without Histogram Enabled
Histogram options without histogram stat
Too Many Histograms
More than 64 histogram-enabled primitive leaves
Mixed Primitive and Non-Primitive Configuration
Primitive + non-primitive fields on same node
Empty Node Configuration
Node with no configuration fields
Empty Namespace Configuration
Namespace without scope or fields
Empty Scope Configuration
Scope without enabled_stats or stat_options
Empty Properties List
Empty properties array
Duplicate Field Names
Same field_name appears more than once
Duplicate Property Paths
Same property path appears more than once
Conflicting Object and Array Configuration
Both properties and items on same node
Unsupported Statistics for Container Node
Stats other than null_value_count in self
Invalid Cron Expression
Malformed cron expression
Type Validation Error
Stats incompatible with the dataset schema types
Related content
Dataset Statistics
Why statistics matter and how they work
Configuring Dataset Statistics
Set up statistics through the platform UI
Job Types
Job types including statistics computation jobs
Tracking Jobs
Monitor job status and handle completion

