This guide covers how to combine data from multiple datasets using JOIN operations. JOINs are essential for enriching your data with information from other sources.
Prerequisites
What you’ll learn
- How to use INNER JOIN to match records across datasets
- How to use LEFT JOIN for optional matches
- How to chain multiple JOINs
- How to use anti-joins to find unmatched records
- How to join with Rosetta Stone for identity resolution
- Performance considerations for large joins
INNER JOIN
An INNER JOIN returns only rows that have matching values in both datasets.
Basic syntax
SELECT
t1.column1,
t2.column2
FROM company_data."123" t1
INNER JOIN company_data."456" t2
ON t1.id = t2.id
Example: Enriching customer data
Combine customer information with their orders:
SELECT
customers.customer_id,
customers.email,
orders.order_id,
orders.order_date,
orders.total_amount
FROM company_data."100" customers
INNER JOIN company_data."200" orders
ON customers.customer_id = orders.customer_id
WHERE
orders.order_date >= CURRENT_DATE - INTERVAL '30' DAY
Only customers who have orders (and orders with valid customers) appear in the results. Customers without orders are excluded.
Using table aliases
Aliases make queries more readable and are required when referencing columns:
SELECT
c.customer_id,
c.name,
o.order_total
FROM company_data."100" AS c
INNER JOIN company_data."200" AS o
ON c.customer_id = o.customer_id
LEFT JOIN
A LEFT JOIN returns all rows from the left table, plus matching rows from the right table. Unmatched rows have NULL values for right-table columns.
Basic syntax
SELECT
t1.column1,
t2.column2
FROM company_data."123" t1
LEFT JOIN company_data."456" t2
ON t1.id = t2.id
Example: Find customers with or without orders
SELECT
c.customer_id,
c.email,
COALESCE(o.order_count, 0) AS order_count
FROM company_data."100" c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS order_count
FROM company_data."200"
GROUP BY customer_id
) o ON c.customer_id = o.customer_id
This returns all customers. Those without orders show order_count = 0.
Filtering LEFT JOIN results
To find records without matches, filter for NULL:
SELECT
c.customer_id,
c.email
FROM company_data."100" c
LEFT JOIN company_data."200" o
ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL -- No matching orders
Multiple JOINs
Chain JOINs to combine data from three or more datasets:
SELECT
c.customer_id,
c.name,
o.order_id,
p.product_name
FROM company_data."100" c
INNER JOIN company_data."200" o
ON c.customer_id = o.customer_id
INNER JOIN company_data."300" p
ON o.product_id = p.product_id
WHERE
o.order_date >= '2024-01-01'
Mixing JOIN types
You can mix INNER and LEFT JOINs:
SELECT
c.customer_id,
c.name,
o.order_id,
r.rating
FROM company_data."100" c
INNER JOIN company_data."200" o
ON c.customer_id = o.customer_id
LEFT JOIN company_data."300" r
ON o.order_id = r.order_id
This returns all customers with orders, optionally including ratings if they exist.
Anti-joins with NOT EXISTS
Find records in one dataset that don’t have matches in another:
Basic anti-join
SELECT
c.customer_id,
c.email
FROM company_data."100" c
WHERE NOT EXISTS (
SELECT 1
FROM company_data."200" o
WHERE c.customer_id = o.customer_id
)
This returns customers who have never placed an order.
Example: Find inactive users
Find users who haven’t logged in recently:
SELECT
u.user_id,
u.email,
u.signup_date
FROM company_data."100" u
WHERE NOT EXISTS (
SELECT 1
FROM company_data."200" logins
WHERE u.user_id = logins.user_id
AND logins.login_date >= CURRENT_DATE - INTERVAL '90' DAY
)
Joining with Rosetta Stone
Rosetta Stone enables identity resolution across datasets using normalized identifiers.
Access Rosetta Stone identifiers
SELECT
d.user_attribute,
d._rosetta_stone.unique_id
FROM company_data."123" d
Join datasets via Rosetta Stone
Match records across datasets using resolved identities:
SELECT
d1.attribute1,
d2.attribute2
FROM company_data."123" d1
INNER JOIN company_data."456" d2
ON d1._rosetta_stone.unique_id = d2._rosetta_stone.unique_id
Query Rosetta Stone directly
SELECT
unique_id,
event_timestamp
FROM narrative.rosetta_stone
WHERE
unique_id.b."type" = 'email'
AND event_timestamp >= CURRENT_DATE - INTERVAL '30' DAY
Join conditions
Multiple conditions
Join on multiple columns:
SELECT
t1.value,
t2.related_value
FROM company_data."123" t1
INNER JOIN company_data."456" t2
ON t1.key1 = t2.key1
AND t1.key2 = t2.key2
Complex conditions
Use expressions in join conditions:
SELECT
events.event_id,
sessions.session_id
FROM company_data."123" events
INNER JOIN company_data."456" sessions
ON events.user_id = sessions.user_id
AND events.event_timestamp BETWEEN sessions.start_time AND sessions.end_time
NULL-safe joins
Use IS NOT DISTINCT FROM for NULL-safe equality:
SELECT t1.*, t2.*
FROM company_data."123" t1
INNER JOIN company_data."456" t2
ON t1.nullable_key IS NOT DISTINCT FROM t2.nullable_key
Self-joins
Join a dataset to itself for comparisons:
SELECT
e1.employee_id,
e1.name AS employee_name,
e2.name AS manager_name
FROM company_data."123" e1
LEFT JOIN company_data."123" e2
ON e1.manager_id = e2.employee_id
Subqueries in JOINs
Join with aggregated or transformed data:
SELECT
c.customer_id,
c.name,
stats.total_spent,
stats.order_count
FROM company_data."100" c
INNER JOIN (
SELECT
customer_id,
SUM(total_amount) AS total_spent,
COUNT(*) AS order_count
FROM company_data."200"
GROUP BY customer_id
) stats ON c.customer_id = stats.customer_id
WHERE stats.total_spent > 1000
Use fully qualified column names
Always qualify column names in joins to avoid ambiguity and improve query parsing:
-- Good: Fully qualified
SELECT
company_data."123".column1,
company_data."456".column2
FROM company_data."123"
INNER JOIN company_data."456"
ON company_data."123".id = company_data."456".id
-- Also good: With aliases
SELECT
t1.column1,
t2.column2
FROM company_data."123" t1
INNER JOIN company_data."456" t2
ON t1.id = t2.id
Filter before joining
Apply filters as early as possible to reduce the data being joined:
-- Better: Filter in subqueries
SELECT
c.customer_id,
o.order_id
FROM (
SELECT * FROM company_data."100"
WHERE status = 'active'
) c
INNER JOIN (
SELECT * FROM company_data."200"
WHERE order_date >= '2024-01-01'
) o ON c.customer_id = o.customer_id
-- Or use CTEs for clarity
WITH active_customers AS (
SELECT * FROM company_data."100"
WHERE status = 'active'
),
recent_orders AS (
SELECT * FROM company_data."200"
WHERE order_date >= '2024-01-01'
)
SELECT
c.customer_id,
o.order_id
FROM active_customers c
INNER JOIN recent_orders o
ON c.customer_id = o.customer_id
Avoid OR in join conditions
OR conditions in joins can cause performance issues:
-- Avoid
ON t1.id = t2.id OR t1.alt_id = t2.alt_id
-- Better: Use COALESCE or separate queries
ON COALESCE(t1.id, t1.alt_id) = COALESCE(t2.id, t2.alt_id)
See Query Optimization for more performance tips.
Include budget controls
Always include budget controls when joining datasets in materialized views:
CREATE MATERIALIZED VIEW "enriched_customers"
AS (
SELECT
c.customer_id,
c.email,
o.total_orders
FROM company_data."100" c
INNER JOIN (
SELECT customer_id, COUNT(*) AS total_orders
FROM company_data."200"
GROUP BY customer_id
) o ON c.customer_id = o.customer_id
)
BUDGET 50 USD
Common patterns
Enrich with latest record
Get the most recent related record for each row:
SELECT
c.customer_id,
c.name,
latest.last_order_date,
latest.last_order_amount
FROM company_data."100" c
INNER JOIN (
SELECT
customer_id,
order_date AS last_order_date,
total_amount AS last_order_amount
FROM company_data."200"
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) = 1
) latest ON c.customer_id = latest.customer_id
Aggregate before joining
Pre-aggregate to reduce join complexity:
SELECT
c.customer_id,
c.segment,
metrics.total_value,
metrics.avg_order
FROM company_data."100" c
INNER JOIN (
SELECT
customer_id,
SUM(amount) AS total_value,
AVG(amount) AS avg_order
FROM company_data."200"
GROUP BY customer_id
) metrics ON c.customer_id = metrics.customer_id
Related content