How Rosetta Stone Works

Rosetta Stone normalizes data through two core primitives: attributes and mappings. Attributes define the common schema—what the normalized data looks like. Mappings translate each provider’s native columns into that common schema.

The two core primitives

Attributes

An attribute is a standardized field definition in the common schema. Each attribute specifies:

Property	Description
Name	A unique identifier (e.g., `hl7_gender`, `event_timestamp`)
Description	Human-readable explanation of what the attribute represents
Type	The data type: `string`, `long`, `double`, `boolean`, `timestamptz`, `object`, or `array`
Validations	Rules that data must satisfy (as an array of validation strings)

Example: Unique Identifier The unique_identifier attribute captures identity data from various sources. It’s defined as an object with three properties:

{
  "id": 100,
  "name": "unique_identifier",
  "type": "object",
  "properties": {
    "type": {
      "type": "string"
    },
    "value": {
      "type": "string"
    },
    "context": {
      "type": "string"
    }
  },
  "required": ["type", "value"]
}

This structure allows different identity types to be represented consistently, whether the source data contains mobile advertising IDs, hashed emails, or cookie identifiers. Example: Gender The hl7_gender attribute normalizes gender data using the HL7 standard. It’s a string type with restricted enum values:

{
  "id": 200,
  "name": "hl7_gender",
  "type": "string",
  "enum": ["male", "female", "other", "unknown"],
  "description": "Gender using HL7 administrative gender codes"
}

Mappings

A mapping connects a specific column in a dataset to an attribute. Each mapping includes:

Property	Description
Source column	The column in the provider’s dataset
Target attribute	The Rosetta Stone attribute to map to
Transformation	An optional expression to convert the data
Dataset	The specific dataset this mapping applies to

Example: Gender mapping A provider stores gender as "M" or "F" in a column called sex. The mapping would be:

{
  "source_column": "sex",
  "target_attribute": "hl7_gender",
  "transformation": "CASE WHEN sex = 'M' THEN 'male' WHEN sex = 'F' THEN 'female' ELSE 'unknown' END"
}

The transformation expression converts the provider’s format to the standard enum values.

The normalization pipeline

Rosetta Stone normalizes data through a three-stage pipeline:

Stage 1: Schema inference

When data is uploaded to Narrative, the system analyzes it to understand its structure:

Column detection: Identifies column names and data types
Pattern recognition: Detects common patterns (dates, identifiers, categorical data)
Attribute suggestion: Uses machine learning to suggest which attributes each column maps to

This automated analysis provides a starting point—suggestions that can be accepted, modified, or rejected.

Stage 2: Mapping creation

Mappings are created through a combination of machine learning and human curation:

Auto-generated mappings: The system proposes mappings based on schema inference
Human review: Data owners review suggestions and refine as needed
Transformation definition: Complex mappings include transformation expressions

This iterative process ensures accuracy while minimizing manual effort. As more data flows through the system, the ML models improve their suggestions.

AI-assisted quality evaluation

After mappings are created, you can use AI to evaluate their quality and suggest improvements:

Confidence scoring: AI analyzes each mapping and assigns a confidence score (0-100%) indicating how likely the mapping is to produce correct results
Issue identification: The system highlights potential problems with transformations, such as missing case handling or type mismatches
Suggestion generation: AI proposes new mappings for columns that aren’t yet normalized

This ongoing refinement helps maintain data quality as source schemas evolve or new data patterns emerge. See Confidence Scoring for details on how the AI evaluation works.

Stage 3: Query-time translation

When you query the narrative.rosetta_stone table:

Query analysis: The system identifies which attributes you’re requesting
Dataset discovery: Finds all datasets with mappings for those attributes
Query translation: Rewrites your query for each dataset’s native schema
Execution: Runs the translated queries against source data
Normalization: Applies transformations and unions results
Return: Delivers data in the consistent, normalized format

This happens transparently—you write one query, and the platform handles the complexity of accessing multiple heterogeneous sources.

Multiple mappings to the same attribute

When a dataset has multiple columns mapped to the same attribute, the normalization process:

Evaluates each mapping independently
Collects results into an array
Expands the array to produce one output row per value
Filters NULL values during expansion

This enables maximum data utilization—if a provider captures email addresses in multiple columns, queries receive all available values without needing to know the source schema structure. For example, if a dataset has email_1 and email_2 both mapped to raw_email, a query selecting that attribute returns one row per non-null email address. See Multiple columns mapped to the same attribute for detailed examples and when to use alternatives like COALESCE.

Normalization examples

Date normalization

Different providers store dates in various formats:

Provider	Column	Sample value
Provider A	`event_date`	`01/15/2024`
Provider B	`timestamp`	`2024-01-15T14:30:00Z`
Provider C	`dt`	`15-Jan-2024`

All three map to the event_timestamp attribute. The mappings include transformations that parse each format and output ISO 8601:

-- Provider A mapping transformation
TO_TIMESTAMP(event_date, 'MM/DD/YYYY')

-- Provider B mapping transformation
timestamp  -- Already ISO 8601, no transformation needed

-- Provider C mapping transformation
TO_TIMESTAMP(dt, 'DD-Mon-YYYY')

When you query event_timestamp, you receive consistent ISO 8601 timestamps regardless of source.

Gender normalization

Providers represent gender in many ways:

Provider	Column	Values
Provider A	`gender`	`"male"`, `"female"`
Provider B	`sex`	`"M"`, `"F"`
Provider C	`gender_code`	`1`, `2`, `0`
Provider D	`gndr`	`"m"`, `"f"`, `"nb"`

Each mapping includes a transformation to the hl7_gender enum:

-- Provider A: Already matches, minimal transformation
LOWER(gender)

-- Provider B: Letter codes to words
CASE sex WHEN 'M' THEN 'male' WHEN 'F' THEN 'female' ELSE 'unknown' END

-- Provider C: Numeric codes to words
CASE gender_code WHEN 1 THEN 'male' WHEN 2 THEN 'female' ELSE 'unknown' END

-- Provider D: Handle non-binary
CASE LOWER(gndr) WHEN 'm' THEN 'male' WHEN 'f' THEN 'female' WHEN 'nb' THEN 'other' ELSE 'unknown' END

Validation and quality

Mappings aren’t just translations—they’re quality gates. Each mapping can enforce validations: Type checking: Ensures values can be cast to the target type Enum validation: Confirms values match allowed enum members Range checking: Verifies numeric values fall within acceptable bounds Pattern matching: Validates strings match expected formats (e.g., email patterns) When data fails validation, the system can:

Reject the record
Map to a default value (like unknown for invalid gender)
Flag the record for review

This ensures the normalized view contains clean, consistent data.

Rosetta Stone Overview

Understand the business problem Rosetta Stone solves

The Normalization Model

Deep dive into the type system and attribute hierarchy

Confidence Scoring

How AI evaluates mapping quality

Mapping Schemas

Step-by-step guide to creating mappings

Overview

Core Primitives

Rosetta Stone

NQL

Data Formats

Identifiers

Architecture

Workflows

Webhooks

Model Inference

Security

Compliance

Data Activation

The two core primitives

Attributes

Mappings

The normalization pipeline

Stage 1: Schema inference

Stage 2: Mapping creation

AI-assisted quality evaluation

Stage 3: Query-time translation

Multiple mappings to the same attribute

Normalization examples

Date normalization

Gender normalization

Validation and quality

Rosetta Stone Overview

The Normalization Model

Confidence Scoring

Mapping Schemas

Overview

Core Primitives

Rosetta Stone

NQL

Data Formats

Identifiers

Architecture

Workflows

Webhooks

Model Inference

Security

Compliance

Data Activation

​The two core primitives

​Attributes

​Mappings

​The normalization pipeline

​Stage 1: Schema inference

​Stage 2: Mapping creation

​AI-assisted quality evaluation

​Stage 3: Query-time translation

​Multiple mappings to the same attribute

​Normalization examples

​Date normalization

​Gender normalization

​Validation and quality

​Related content

Rosetta Stone Overview

The Normalization Model

Confidence Scoring

Mapping Schemas

The two core primitives

Attributes

Mappings

The normalization pipeline

Stage 1: Schema inference

Stage 2: Mapping creation

AI-assisted quality evaluation

Stage 3: Query-time translation

Multiple mappings to the same attribute

Normalization examples

Date normalization

Gender normalization

Validation and quality

Related content