Dataset Statistics Reference - Narrative I/O Knowledge Base

This page is the definitive reference for dataset statistics — the column-level metrics the platform computes, their type compatibility, and the configuration API for controlling what gets computed and when. For background on what statistics are and why they matter, see Dataset Statistics. To configure statistics through the platform UI, see the configuring dataset statistics guide.

API endpoints

Method	Path	Description
`PUT`	`/datasets/{dataset_id}/statistics-configuration`	Create or replace the statistics configuration
`GET`	`/datasets/{dataset_id}/statistics-configuration`	Retrieve the current configuration
`DELETE`	`/datasets/{dataset_id}/statistics-configuration`	Disable statistics (stop computing)

All endpoints require a Bearer token with dataset write permissions.

Configuration schema overview

A statistics configuration has four top-level fields:

{
  "defaults": { ... },       // required — global stat selection
  "refresh": { ... },        // required — when to recompute
  "dataset": { ... },        // optional — overrides for dataset columns
  "rosetta_stone": { ... }   // optional — overrides for Rosetta Stone attributes
}

`defaults` (required)

The global stat selection applied to every eligible field. Must include enabled_stats — there is no parent to inherit from at this level.

{
  "enabled_stats": ["value_count", "null_value_count", "histogram"],
  "stat_options": {
    "histogram": { "max_bins": 100, "overflow": "truncate" }
  }
}

`refresh` (required)

Controls when statistics are recomputed. Exactly one trigger type:

Trigger	Payload	Behavior
`cron`	`{ "trigger": "cron", "cron_expression": "0 0 * * *" }`	Recompute on a UTC cron schedule
`on_update`	`{ "trigger": "on_update" }`	Recompute whenever dataset data changes
`manual`	`{ "trigger": "manual" }`	Only compute on demand

`dataset` and `rosetta_stone` (optional)

Namespace-level overrides. Each namespace has:

scope — overrides defaults for all fields in this namespace
fields — per-field overrides for individual columns/attributes

At least one of scope or fields must be provided if the namespace is present.

Supported statistics

Each statistic is computed per column. The computation column shows the underlying SQL-level operation.

Stat name	Description	Computation
`value_count`	Total number of non-null values in the column	`COUNT(column)`
`null_value_count`	Number of null values in the column	`COUNT(*) - COUNT(column)`
`nan_value_count`	Number of NaN (not-a-number) values — floating-point columns only	`COUNT_IF(IS_NAN(column))`
`lower_bound`	Minimum value in the column	`MIN(column)`
`upper_bound`	Maximum value in the column	`MAX(column)`
`approx_count_distinct`	Approximate number of distinct values, using HyperLogLog for efficiency on large datasets	`APPROX_COUNT_DISTINCT(column)`
`count_distinct`	Exact number of distinct values	`COUNT(DISTINCT column)`
`histogram`	Frequency distribution of values across distinct buckets	Aggregation over value frequencies
`mean`	Arithmetic mean — numeric columns only	`AVG(column)`
`standard_deviation`	Population standard deviation — numeric columns only	`STDDEV(column)`
`completeness`	Ratio of non-null values to total rows (0 to 1)	`COUNT(column) / COUNT(*)`

approx_count_distinct and count_distinct cannot both appear in the same enabled_stats list. Choose one. See Conflicting Distinct Count Methods.

Type compatibility matrix

Not all statistics apply to all data types. The table below shows which statistics are computed for each type.

Type	Supported statistics
`DoubleType`	All 11: value_count, null_value_count, nan_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness
`LongType`	10 (all except nan_value_count): value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, mean, standard_deviation, completeness
`StringType`	8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness
`TimestampTzType`	8: value_count, null_value_count, approx_count_distinct, count_distinct, lower_bound, upper_bound, histogram, completeness
`BooleanType`	6: value_count, null_value_count, approx_count_distinct, count_distinct, histogram, completeness
`ArrayType`	2: value_count, null_value_count
`ObjectType`	2: value_count, null_value_count

DoubleType is the only type that supports nan_value_count, since NaN is a floating-point concept. mean and standard_deviation are limited to numeric types (DoubleType and LongType). lower_bound and upper_bound require ordered types (numeric, timestamp, or string).

Inheritance model

The configuration uses a layered inheritance chain:

defaults → namespace scope → field override → nested node overrides

Two distinct rules apply:

1. Stat set (`enabled_stats`) — all-or-nothing replacement

When a level provides enabled_stats, it completely replaces the parent’s stat set. There is no merging.

2. Stat options (`stat_options`) — per-property inheritance

When a level provides only stat_options (without enabled_stats), it inherits the stat set from its parent and overrides only the specified option properties. Unspecified properties keep the value from the resolved parent — not from the original defaults. This means overrides accumulate through the chain:

defaults:  histogram(max_bins=50, overflow=none)
    ↓ scope overrides max_bins
scope:     histogram(max_bins=200, overflow=none)
    ↓ field overrides overflow
field:     histogram(max_bins=200, overflow=truncate)

The field gets max_bins=200 from scope, not max_bins=50 from defaults.

Primitive vs. non-primitive fields

Fields map to one of two node kinds based on their schema type:

Node kind	Schema types	Available config fields
Primitive	string, integer, float, boolean, timestamp	`enabled_stats`, `stat_options`
Non-primitive	object, array	`self`, `properties`, `items`

You cannot mix primitive and non-primitive fields on the same node. See MixedNodeConfig. For non-primitive nodes:

self — stats on the container column itself (only null_value_count is supported)
properties — explicit configs for named children of an object (opt-in, not inherited)
items — config for array element nodes

Example 1: Simple defaults-only configuration

The simplest useful configuration: define defaults and a refresh trigger, and every eligible field gets the same stats. Goal: Compute value_count, null_value_count, and completeness for all fields, refreshing daily at midnight UTC.

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "completeness"]
  },
  "refresh": {
    "trigger": "cron",
    "cron_expression": "0 0 * * *"
  }
}

That’s it. The server applies these three stats to every compatible column in both the dataset schema and Rosetta Stone attributes. Stats incompatible with a field’s type (e.g., mean on a string column) are automatically filtered.

Adding histogram with custom options

Extend the defaults to include a histogram with 200 bins and truncation overflow:

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "completeness", "histogram"],
    "stat_options": {
      "histogram": {
        "max_bins": 200,
        "overflow": "truncate"
      }
    }
  },
  "refresh": {
    "trigger": "on_update"
  }
}

Disabling stats for a specific namespace

If you want stats on dataset columns but not Rosetta Stone attributes:

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count"]
  },
  "refresh": { "trigger": "manual" },
  "rosetta_stone": {
    "scope": {
      "enabled_stats": []
    }
  }
}

The empty enabled_stats: [] explicitly disables all stats for Rosetta Stone fields.

Example 2: Complex configuration with nested structures

Scenario: A dataset with these columns:

user_id (string) — needs exact distinct count
price (float) — needs histogram with custom bins
address (object with children city, zip, state) — need stats on city and zip only
tags (array of strings) — need stats on array items, and null count on the array column itself

Step 1: Set defaults

Start with a broad set of defaults:

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" }
}

Step 2: Add namespace scope to tune histogram bins

Override histogram bins for all dataset fields to 200, while keeping the stat set inherited from defaults:

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" },
  "dataset": {
    "scope": {
      "stat_options": {
        "histogram": { "max_bins": 200 }
      }
    }
  }
}

Because scope provides only stat_options (no enabled_stats), it inherits the full stat set from defaults. Only max_bins is overridden — overflow retains the server default of none.

Step 3: Add per-field overrides

Now add the field-specific configurations:

{
  "defaults": {
    "enabled_stats": ["value_count", "null_value_count", "histogram", "completeness"],
    "stat_options": {
      "histogram": { "max_bins": 50 }
    }
  },
  "refresh": { "trigger": "cron", "cron_expression": "0 3 * * 1" },
  "dataset": {
    "scope": {
      "stat_options": {
        "histogram": { "max_bins": 200 }
      }
    },
    "fields": [
      {
        "field_name": "user_id",
        "enabled_stats": ["value_count", "count_distinct"]
      },
      {
        "field_name": "price",
        "stat_options": {
          "histogram": { "overflow": "truncate" }
        }
      },
      {
        "field_name": "address",
        "self": {
          "enabled_stats": ["null_value_count"]
        },
        "properties": [
          {
            "path": "city",
            "enabled_stats": ["value_count", "approx_count_distinct"]
          },
          {
            "path": "zip",
            "enabled_stats": ["value_count", "null_value_count", "histogram"],
            "stat_options": {
              "histogram": { "max_bins": 1000, "overflow": "truncate" }
            }
          }
        ]
      },
      {
        "field_name": "tags",
        "self": {
          "enabled_stats": ["null_value_count"]
        },
        "items": {
          "enabled_stats": ["value_count", "approx_count_distinct", "histogram"]
        }
      }
    ]
  }
}

What each field resolves to

Field	Resolution	Effective stats
`user_id`	`enabled_stats` replaces inherited set	`value_count`, `count_distinct`
`price`	Inherits stat set from scope (which inherited from defaults); overrides `overflow`	`value_count`, `null_value_count`, `histogram(max_bins=200, overflow=truncate)`, `completeness`
`address` (self)	Only `null_value_count` (the only stat supported for non-primitive self)	`null_value_count`
`address.city`	`enabled_stats` replaces inherited set	`value_count`, `approx_count_distinct`
`address.zip`	`enabled_stats` with custom histogram options	`value_count`, `null_value_count`, `histogram(max_bins=1000, overflow=truncate)`
`address.state`	Not listed in `properties` — receives no stats	(none)
`tags` (self)	Only `null_value_count` for the array column	`null_value_count`
`tags` (items)	`enabled_stats` replaces inherited set	`value_count`, `approx_count_distinct`, `histogram(max_bins=200, overflow=none)`

The address.state field gets no stats because it is not listed in the properties array. Nested property configuration is opt-in — unlisted children are skipped entirely.

Validation errors

When a configuration violates a constraint, the API returns a 400 Bad Request with a descriptive error message. Multiple validation errors may be reported in a single response, separated by semicolons. Each error is documented with its exact trigger condition, an example request that produces it, and how to fix it:

Missing Required Default Statistics

defaults must include enabled_stats

Duplicate Statistics

Same stat listed more than once

Conflicting Distinct Count Methods

approx_count_distinct + count_distinct together

Invalid Histogram Bin Count

max_bins outside [2, 100000]

Histogram Options Without Histogram Enabled

Histogram options without histogram stat

Too Many Histograms

More than 64 histogram-enabled primitive leaves

Mixed Primitive and Non-Primitive Configuration

Primitive + non-primitive fields on same node

Empty Node Configuration

Node with no configuration fields

Empty Namespace Configuration

Namespace without scope or fields

Empty Scope Configuration

Scope without enabled_stats or stat_options

Empty Properties List

Empty properties array

Duplicate Field Names

Same field_name appears more than once

Duplicate Property Paths

Same property path appears more than once

Conflicting Object and Array Configuration

Both properties and items on same node

Unsupported Statistics for Container Node

Stats other than null_value_count in self

Invalid Cron Expression

Malformed cron expression

Type Validation Error

Stats incompatible with the dataset schema types