The two core primitives
Attributes
An attribute is a standardized field definition in the common schema. Each attribute specifies:| Property | Description |
|---|---|
| Name | A unique identifier (e.g., hl7_gender, event_timestamp) |
| Description | Human-readable explanation of what the attribute represents |
| Type | The data type: string, long, double, boolean, timestamptz, object, or array |
| Validations | Rules that data must satisfy (as an array of validation strings) |
unique_identifier attribute captures identity data from various sources. It’s defined as an object with three properties:
hl7_gender attribute normalizes gender data using the HL7 standard. It’s a string type with restricted enum values:
Mappings
A mapping connects a specific column in a dataset to an attribute. Each mapping includes:| Property | Description |
|---|---|
| Source column | The column in the provider’s dataset |
| Target attribute | The Rosetta Stone attribute to map to |
| Transformation | An optional expression to convert the data |
| Dataset | The specific dataset this mapping applies to |
"M" or "F" in a column called sex. The mapping would be:
The normalization pipeline
Rosetta Stone normalizes data through a three-stage pipeline:Stage 1: Schema inference
When data is uploaded to Narrative, the system analyzes it to understand its structure:- Column detection: Identifies column names and data types
- Pattern recognition: Detects common patterns (dates, identifiers, categorical data)
- Attribute suggestion: Uses machine learning to suggest which attributes each column maps to
Stage 2: Mapping creation
Mappings are created through a combination of machine learning and human curation:- Auto-generated mappings: The system proposes mappings based on schema inference
- Human review: Data owners review suggestions and refine as needed
- Transformation definition: Complex mappings include transformation expressions
Stage 3: Query-time translation
When you query thenarrative.rosetta_stone table:
- Query analysis: The system identifies which attributes you’re requesting
- Dataset discovery: Finds all datasets with mappings for those attributes
- Query translation: Rewrites your query for each dataset’s native schema
- Execution: Runs the translated queries against source data
- Normalization: Applies transformations and unions results
- Return: Delivers data in the consistent, normalized format
Normalization examples
Date normalization
Different providers store dates in various formats:| Provider | Column | Sample value |
|---|---|---|
| Provider A | event_date | 01/15/2024 |
| Provider B | timestamp | 2024-01-15T14:30:00Z |
| Provider C | dt | 15-Jan-2024 |
event_timestamp attribute. The mappings include transformations that parse each format and output ISO 8601:
event_timestamp, you receive consistent ISO 8601 timestamps regardless of source.
Gender normalization
Providers represent gender in many ways:| Provider | Column | Values |
|---|---|---|
| Provider A | gender | "male", "female" |
| Provider B | sex | "M", "F" |
| Provider C | gender_code | 1, 2, 0 |
| Provider D | gndr | "m", "f", "nb" |
hl7_gender enum:
Validation and quality
Mappings aren’t just translations—they’re quality gates. Each mapping can enforce validations: Type checking: Ensures values can be cast to the target type Enum validation: Confirms values match allowed enum members Range checking: Verifies numeric values fall within acceptable bounds Pattern matching: Validates strings match expected formats (e.g., email patterns) When data fails validation, the system can:- Reject the record
- Map to a default value (like
unknownfor invalid gender) - Flag the record for review

