Joining Datasets - Narrative I/O Knowledge Base

This guide covers how to combine data from multiple datasets using JOIN operations. JOINs are essential for enriching your data with information from other sources.

Prerequisites

A Narrative I/O account with access to multiple datasets
Understanding of filtering and transforming data

What you’ll learn

How to use INNER JOIN to match records across datasets
How to use LEFT JOIN for optional matches
How to chain multiple JOINs
How to use anti-joins to find unmatched records
How to join with Rosetta Stone for identity resolution
Performance considerations for large joins

INNER JOIN

An INNER JOIN returns only rows that have matching values in both datasets.

Basic syntax

SELECT
  t1.column1,
  t2.column2
FROM company_data."123" t1
INNER JOIN company_data."456" t2
  ON t1.id = t2.id

Example: Enriching customer data

Combine customer information with their orders:

SELECT
  customers.customer_id,
  customers.email,
  orders.order_id,
  orders.order_date,
  orders.total_amount
FROM company_data."100" customers
INNER JOIN company_data."200" orders
  ON customers.customer_id = orders.customer_id
WHERE
  orders.order_date >= CURRENT_DATE - INTERVAL '30' DAY

Only customers who have orders (and orders with valid customers) appear in the results. Customers without orders are excluded.

Using table aliases

Aliases make queries more readable and are required when referencing columns:

SELECT
  c.customer_id,
  c.name,
  o.order_total
FROM company_data."100" AS c
INNER JOIN company_data."200" AS o
  ON c.customer_id = o.customer_id

LEFT JOIN

A LEFT JOIN returns all rows from the left table, plus matching rows from the right table. Unmatched rows have NULL values for right-table columns.

Basic syntax

SELECT
  t1.column1,
  t2.column2
FROM company_data."123" t1
LEFT JOIN company_data."456" t2
  ON t1.id = t2.id

Example: Find customers with or without orders

SELECT
  c.customer_id,
  c.email,
  COALESCE(o.order_count, 0) AS order_count
FROM company_data."100" c
LEFT JOIN (
  SELECT customer_id, COUNT(1) AS order_count
  FROM company_data."200"
  GROUP BY customer_id
) o ON c.customer_id = o.customer_id

This returns all customers. Those without orders show order_count = 0.

Filtering LEFT JOIN results

To find records without matches, filter for NULL:

SELECT
  c.customer_id,
  c.email
FROM company_data."100" c
LEFT JOIN company_data."200" o
  ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL  -- No matching orders

Multiple JOINs

Chain JOINs to combine data from three or more datasets:

SELECT
  c.customer_id,
  c.name,
  o.order_id,
  p.product_name
FROM company_data."100" c
INNER JOIN company_data."200" o
  ON c.customer_id = o.customer_id
INNER JOIN company_data."300" p
  ON o.product_id = p.product_id
WHERE
  o.order_date >= '2024-01-01'

Mixing JOIN types

You can mix INNER and LEFT JOINs:

SELECT
  c.customer_id,
  c.name,
  o.order_id,
  r.rating
FROM company_data."100" c
INNER JOIN company_data."200" o
  ON c.customer_id = o.customer_id
LEFT JOIN company_data."300" r
  ON o.order_id = r.order_id

This returns all customers with orders, optionally including ratings if they exist.

Anti-joins with NOT EXISTS

Find records in one dataset that don’t have matches in another:

Basic anti-join

SELECT
  c.customer_id,
  c.email
FROM company_data."100" c
WHERE NOT EXISTS (
  SELECT 1
  FROM company_data."200" o
  WHERE c.customer_id = o.customer_id
)

This returns customers who have never placed an order.

Example: Find inactive users

Find users who haven’t logged in recently:

SELECT
  u.user_id,
  u.email,
  u.signup_date
FROM company_data."100" u
WHERE NOT EXISTS (
  SELECT 1
  FROM company_data."200" logins
  WHERE u.user_id = logins.user_id
    AND logins.login_date >= CURRENT_DATE - INTERVAL '90' DAY
)

Joining with Rosetta Stone

Rosetta Stone enables identity resolution across datasets using normalized identifiers.

Access Rosetta Stone identifiers

SELECT
  d.user_attribute,
  d._rosetta_stone.unique_id
FROM company_data."123" d

Join datasets via Rosetta Stone

Match records across datasets using resolved identities:

SELECT
  d1.attribute1,
  d2.attribute2
FROM company_data."123" d1
INNER JOIN company_data."456" d2
  ON d1._rosetta_stone.unique_id = d2._rosetta_stone.unique_id

Query Rosetta Stone directly

SELECT
  unique_id,
  event_timestamp
FROM narrative.rosetta_stone
WHERE
  unique_id.b."type" = 'email'
  AND event_timestamp >= CURRENT_DATE - INTERVAL '30' DAY

Rosetta Stone queries can be scoped to different levels: global (narrative.rosetta_stone), company-specific (company_data._rosetta_stone), or dataset-specific (company_data."123"._rosetta_stone). See Rosetta Stone table references for complete syntax.

Join conditions

Multiple conditions

Join on multiple columns:

SELECT
  t1.value,
  t2.related_value
FROM company_data."123" t1
INNER JOIN company_data."456" t2
  ON t1.key1 = t2.key1
  AND t1.key2 = t2.key2

Complex conditions

Use expressions in join conditions:

SELECT
  events.event_id,
  sessions.session_id
FROM company_data."123" events
INNER JOIN company_data."456" sessions
  ON events.user_id = sessions.user_id
  AND events.event_timestamp BETWEEN sessions.start_time AND sessions.end_time

NULL-safe joins

Use IS NOT DISTINCT FROM for NULL-safe equality:

SELECT
  t1.id,
  t1.nullable_key,
  t2.value,
  t2.nullable_key AS t2_nullable_key
FROM company_data."123" t1
INNER JOIN company_data."456" t2
  ON t1.nullable_key IS NOT DISTINCT FROM t2.nullable_key

Self-joins

Join a dataset to itself for comparisons:

SELECT
  e1.employee_id,
  e1.name AS employee_name,
  e2.name AS manager_name
FROM company_data."123" e1
LEFT JOIN company_data."123" e2
  ON e1.manager_id = e2.employee_id

Subqueries in JOINs

Join with aggregated or transformed data:

SELECT
  c.customer_id,
  c.name,
  stats.total_spent,
  stats.order_count
FROM company_data."100" c
INNER JOIN (
  SELECT
    customer_id,
    SUM(total_amount) AS total_spent,
    COUNT(1) AS order_count
  FROM company_data."200"
  GROUP BY customer_id
) stats ON c.customer_id = stats.customer_id
WHERE stats.total_spent > 1000

Performance considerations

Use fully qualified column names

Always qualify column names in joins to avoid ambiguity and improve query parsing:

-- Good: Fully qualified
SELECT
  company_data."123".column1,
  company_data."456".column2
FROM company_data."123"
INNER JOIN company_data."456"
  ON company_data."123".id = company_data."456".id

-- Also good: With aliases
SELECT
  t1.column1,
  t2.column2
FROM company_data."123" t1
INNER JOIN company_data."456" t2
  ON t1.id = t2.id

Filter before joining

Apply filters as early as possible to reduce the data being joined:

-- Better: Filter in subqueries
SELECT
  c.customer_id,
  o.order_id
FROM (
  SELECT customer_id, email, status FROM company_data."100"
  WHERE status = 'active'
) c
INNER JOIN (
  SELECT order_id, customer_id, order_date FROM company_data."200"
  WHERE order_date >= '2024-01-01'
) o ON c.customer_id = o.customer_id

-- Or use CTEs for clarity
WITH active_customers AS (
  SELECT customer_id, email, status FROM company_data."100"
  WHERE status = 'active'
),
recent_orders AS (
  SELECT order_id, customer_id, order_date FROM company_data."200"
  WHERE order_date >= '2024-01-01'
)
SELECT
  c.customer_id,
  o.order_id
FROM active_customers c
INNER JOIN recent_orders o
  ON c.customer_id = o.customer_id

Avoid OR in join conditions

OR conditions in joins can cause performance issues:

-- Avoid
ON t1.id = t2.id OR t1.alt_id = t2.alt_id

-- Better: Use COALESCE or separate queries
ON COALESCE(t1.id, t1.alt_id) = COALESCE(t2.id, t2.alt_id)

See Query Optimization for more performance tips.

Include budget controls

Always include budget controls when joining datasets in materialized views:

CREATE MATERIALIZED VIEW "enriched_customers"
AS (
  SELECT
    c.customer_id,
    c.email,
    o.total_orders
  FROM company_data."100" c
  INNER JOIN (
    SELECT customer_id, COUNT(1) AS total_orders
    FROM company_data."200"
    GROUP BY customer_id
  ) o ON c.customer_id = o.customer_id
)
BUDGET 50 USD

Common patterns

Enrich with latest record

Get the most recent related record for each row:

SELECT
  c.customer_id,
  c.name,
  latest.last_order_date,
  latest.last_order_amount
FROM company_data."100" c
INNER JOIN (
  SELECT
    customer_id,
    order_date AS last_order_date,
    total_amount AS last_order_amount
  FROM company_data."200"
  QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY order_date DESC
  ) = 1
) latest ON c.customer_id = latest.customer_id

Aggregate before joining

Pre-aggregate to reduce join complexity:

SELECT
  c.customer_id,
  c.segment,
  metrics.total_value,
  metrics.avg_order
FROM company_data."100" c
INNER JOIN (
  SELECT
    customer_id,
    SUM(amount) AS total_value,
    AVG(amount) AS avg_order
  FROM company_data."200"
  GROUP BY customer_id
) metrics ON c.customer_id = metrics.customer_id

Join Performance

Understanding performance implications

Query Optimization

Tips for faster queries

Complex Joins Cookbook

Advanced join patterns and examples

Rosetta Stone

Identity resolution for cross-dataset matching

Overview

Platform

Data Ingestion

Querying with NQL

Data Collaboration

Data Activation

Connectors

Compliance

Rosetta Stone

Account Settings

Workflows

Webhooks

Tools

SDKs

​Prerequisites

​What you’ll learn

​INNER JOIN

​Basic syntax

​Example: Enriching customer data

​Using table aliases

​LEFT JOIN

​Basic syntax

​Example: Find customers with or without orders

​Filtering LEFT JOIN results

​Multiple JOINs

​Mixing JOIN types

​Anti-joins with NOT EXISTS

​Basic anti-join

​Example: Find inactive users

​Joining with Rosetta Stone

​Access Rosetta Stone identifiers

​Join datasets via Rosetta Stone

​Query Rosetta Stone directly

​Join conditions

​Multiple conditions

​Complex conditions

​NULL-safe joins

​Self-joins

​Subqueries in JOINs

​Performance considerations

​Use fully qualified column names

​Filter before joining

​Avoid OR in join conditions

​Include budget controls

​Common patterns

​Enrich with latest record

​Aggregate before joining

​Related content

Join Performance

Query Optimization

Complex Joins Cookbook

Rosetta Stone

Prerequisites

What you’ll learn

INNER JOIN

Basic syntax

Example: Enriching customer data

Using table aliases

LEFT JOIN

Basic syntax

Example: Find customers with or without orders

Filtering LEFT JOIN results

Multiple JOINs

Mixing JOIN types

Anti-joins with NOT EXISTS

Basic anti-join

Example: Find inactive users

Joining with Rosetta Stone

Access Rosetta Stone identifiers

Join datasets via Rosetta Stone

Query Rosetta Stone directly

Join conditions

Multiple conditions

Complex conditions

NULL-safe joins

Self-joins

Subqueries in JOINs

Performance considerations

Use fully qualified column names

Filter before joining

Avoid OR in join conditions

Include budget controls

Common patterns

Enrich with latest record

Aggregate before joining

Related content