Skip to main content
This page is the definitive reference for dataset statistics — the column-level metrics the platform computes, their type compatibility, and the configuration API for controlling what gets computed and when. For background on what statistics are and why they matter, see Dataset Statistics. To configure statistics through the platform UI, see the configuring dataset statistics guide.

API endpoints

MethodPathDescription
PUT/datasets/{dataset_id}/statistics-configurationCreate or replace the statistics configuration
GET/datasets/{dataset_id}/statistics-configurationRetrieve the current configuration
DELETE/datasets/{dataset_id}/statistics-configurationDisable statistics (stop computing)
All endpoints require a Bearer token with dataset write permissions.

Configuration schema overview

A statistics configuration has four top-level fields:
{
  "defaults": { ... },       // required — global stat selection
  "refresh": { ... },        // required — when to recompute
  "dataset": { ... },        // optional — overrides for dataset columns
  "rosetta_stone": { ... }   // optional — overrides for Rosetta Stone attributes
}

defaults (required)

The global stat selection applied to every eligible field. Must include enabled_stats — there is no parent to inherit from at this level.
{
  "enabled_stats": ["value_count", "null_value_count", "histogram"],
  "stat_options": {
    "histogram": { "max_bins": 100, "overflow": "truncate" }
  }
}

refresh (required)

Controls when statistics are recomputed. Exactly one trigger type:
TriggerPayloadBehavior
cron{ "trigger": "cron", "cron_expression": "0 0 * * *" }Recompute on a UTC cron schedule
on_update{ "trigger": "on_update" }Recompute whenever dataset data changes
manual{ "trigger": "manual" }Only compute on demand

dataset and rosetta_stone (optional)

Namespace-level overrides. Each namespace has:
  • scope — overrides defaults for all fields in this namespace
  • fields — per-field overrides for individual columns/attributes
At least one of scope or fields must be provided if the namespace is present.

Supported statistics

Each statistic is computed per column. The computation column shows the underlying SQL-level operation.
Stat nameDescriptionComputation
value_countTotal number of non-null values in the columnCOUNT(column)
null_value_countNumber of null values in the columnCOUNT(*) - COUNT(column)
nan_value_countNumber of NaN (not-a-number) values — floating-point columns onlyCOUNT_IF(IS_NAN(column))
lower_boundMinimum value in the columnMIN(column)
upper_boundMaximum value in the columnMAX(column)
approx_count_distinctApproximate number of distinct values, using HyperLogLog for efficiency on large datasetsAPPROX_COUNT_DISTINCT(column)
count_distinctExact number of distinct valuesCOUNT(DISTINCT column)
histogramFrequency distribution of values across distinct bucketsAggregation over value frequencies
meanArithmetic mean — numeric columns onlyAVG(column)
standard_deviationPopulation standard deviation — numeric columns onlySTDDEV(column)
completenessRatio of non-null values to total rows (0 to 1)COUNT(column) / COUNT(*)
approx_count_distinct and count_distinct cannot both appear in the same enabled_stats list. Choose one. See Conflicting Distinct Count Methods.

Type compatibility matrix

Not all statistics apply to all data types. The table below shows which statistics are computed for each type.
TypeSupported statistics
DoubleTypeAll 11: value_count, null_value_count, nan_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness
LongType10 (all except nan_value_count): value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness
StringType8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness
TimestampTzType8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness
BooleanType6: value_count, null_value_count, approx_count_distinct, count_distinct, histogram, completeness
ArrayType2: value_count, null_value_count
ObjectType2: value_count, null_value_count
DoubleType is the only type that supports nan_value_count, since NaN is a floating-point concept. mean and standard_deviation are limited to numeric types (DoubleType and LongType). lower_bound and upper_bound require ordered types (numeric, timestamp, or string).

Inheritance model

The configuration uses a layered inheritance chain:
defaults → namespace scope → field override → nested node overrides
Two distinct rules apply:

1. Stat set (enabled_stats) — all-or-nothing replacement

When a level provides enabled_stats, it completely replaces the parent’s stat set. There is no merging.

2. Stat options (stat_options) — per-property inheritance

When a level provides only stat_options (without enabled_stats), it inherits the stat set from its parent and overrides only the specified option properties. Unspecified properties keep the value from the resolved parent — not from the original defaults. This means overrides accumulate through the chain:
defaults:  histogram(max_bins=50, overflow=none)
    ↓ scope overrides max_bins
scope:     histogram(max_bins=200, overflow=none)
    ↓ field overrides overflow
field:     histogram(max_bins=200, overflow=truncate)
The field gets max_bins=200 from scope, not max_bins=50 from defaults.

Primitive vs. non-primitive fields

Fields map to one of two node kinds based on their schema type:
Node kindSchema typesAvailable config fields
Primitivestring, integer, float, boolean, timestampenabled_stats, stat_options
Non-primitiveobject, arrayself, properties, items
You cannot mix primitive and non-primitive fields on the same node. See MixedNodeConfig. For non-primitive nodes:
  • self — stats on the container column itself (only null_value_count is supported)
  • properties — explicit configs for named children of an object (opt-in, not inherited)
  • items — config for array element nodes

Example 1: Simple defaults-only configuration

The simplest useful configuration: define defaults and a refresh trigger, and every eligible field gets the same stats. Goal: Compute value_count, null_value_count, and completeness for all fields, refreshing daily at midnight UTC.
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "completeness"]
  },
  "refresh": {
    "trigger": "cron",
    "cron_expression": "0 0 * * *"
  }
}
That’s it. The server applies these three stats to every compatible column in both the dataset schema and Rosetta Stone attributes. Stats incompatible with a field’s type (e.g., mean on a string column) are automatically filtered.

Adding histogram with custom options

Extend the defaults to include a histogram with 200 bins and truncation overflow:
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "completeness", "histogram"],
    "stat_options": {
      "histogram": {
        "max_bins": 200,
        "overflow": "truncate"
      }
    }
  },
  "refresh": {
    "trigger": "on_update"
  }
}

Disabling stats for a specific namespace

If you want stats on dataset columns but not Rosetta Stone attributes:
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count"]
  },
  "refresh": { "trigger": "manual" },
  "rosetta_stone": {
    "scope": {
      "enabled_stats": []
    }
  }
}
The empty enabled_stats: [] explicitly disables all stats for Rosetta Stone fields.

Example 2: Complex configuration with nested structures

Scenario: A dataset with these columns:
  • user_id (string) — needs exact distinct count
  • price (float) — needs histogram with custom bins
  • address (object with children city, zip, state) — need stats on city and zip only
  • tags (array of strings) — need stats on array items, and null count on the array column itself

Step 1: Set defaults

Start with a broad set of defaults:
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" }
}

Step 2: Add namespace scope to tune histogram bins

Override histogram bins for all dataset fields to 200, while keeping the stat set inherited from defaults:
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" },
  "dataset": {
    "scope": {
      "stat_options": {
        "histogram": { "max_bins": 200 }
      }
    }
  }
}
Because scope provides only stat_options (no enabled_stats), it inherits the full stat set from defaults. Only max_bins is overridden — overflow retains the server default of none.

Step 3: Add per-field overrides

Now add the field-specific configurations:
{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" },
  "dataset": {
    "scope": {
      "stat_options": {
        "histogram": { "max_bins": 200 }
      }
    },
    "fields": [
      {
        "field_name": "user_id",
        "enabled_stats": ["value_count", "count_distinct"]
      },
      {
        "field_name": "price",
        "stat_options": {
          "histogram": { "overflow": "truncate" }
        }
      },
      {
        "field_name": "address",
        "self": {
          "enabled_stats": ["null_value_count"]
        },
        "properties": [
          {
            "path": "city",
            "enabled_stats": ["value_count", "approx_count_distinct"]
          },
          {
            "path": "zip",
            "enabled_stats": ["value_count", "null_value_count", "histogram"],
            "stat_options": {
              "histogram": { "max_bins": 1000, "overflow": "truncate" }
            }
          }
        ]
      },
      {
        "field_name": "tags",
        "self": {
          "enabled_stats": ["null_value_count"]
        },
        "items": {
          "enabled_stats": ["value_count", "approx_count_distinct", "histogram"]
        }
      }
    ]
  }
}

What each field resolves to

FieldResolutionEffective stats
user_idenabled_stats replaces inherited setvalue_count, count_distinct
priceInherits stat set from scope (which inherited from defaults); overrides overflowvalue_count, null_value_count, histogram(max_bins=200, overflow=truncate), completeness
address (self)Only null_value_count (the only stat supported for non-primitive self)null_value_count
address.cityenabled_stats replaces inherited setvalue_count, approx_count_distinct
address.zipenabled_stats with custom histogram optionsvalue_count, null_value_count, histogram(max_bins=1000, overflow=truncate)
address.stateNot listed in properties — receives no stats(none)
tags (self)Only null_value_count for the array columnnull_value_count
tags (items)enabled_stats replaces inherited setvalue_count, approx_count_distinct, histogram(max_bins=200, overflow=none)
The address.state field gets no stats because it is not listed in the properties array. Nested property configuration is opt-in — unlisted children are skipped entirely.

Validation errors

When a configuration violates a constraint, the API returns a 400 Bad Request with a descriptive error message. Multiple validation errors may be reported in a single response, separated by semicolons. Each error is documented with its exact trigger condition, an example request that produces it, and how to fix it:

Missing Required Default Statistics

defaults must include enabled_stats

Duplicate Statistics

Same stat listed more than once

Conflicting Distinct Count Methods

approx_count_distinct + count_distinct together

Invalid Histogram Bin Count

max_bins outside [2, 100000]

Histogram Options Without Histogram Enabled

Histogram options without histogram stat

Too Many Histograms

More than 64 histogram-enabled primitive leaves

Mixed Primitive and Non-Primitive Configuration

Primitive + non-primitive fields on same node

Empty Node Configuration

Node with no configuration fields

Empty Namespace Configuration

Namespace without scope or fields

Empty Scope Configuration

Scope without enabled_stats or stat_options

Empty Properties List

Empty properties array

Duplicate Field Names

Same field_name appears more than once

Duplicate Property Paths

Same property path appears more than once

Conflicting Object and Array Configuration

Both properties and items on same node

Unsupported Statistics for Container Node

Stats other than null_value_count in self

Invalid Cron Expression

Malformed cron expression

Type Validation Error

Stats incompatible with the dataset schema types

Dataset Statistics

Why statistics matter and how they work

Configuring Dataset Statistics

Set up statistics through the platform UI

Job Types

Job types including statistics computation jobs

Tracking Jobs

Monitor job status and handle completion