Building an identity graph

This guide walks through building an identity graph from a CRM dataset using Graph Studio. By the end, you will have resolved duplicate customer records into unified identities based on shared emails and phone numbers.

Graph Studio runs on both Snowflake and AWS data planes. For a conceptual overview, see Graph Studio.

Prerequisites

A dataset with Rosetta Stone attribute mappings for a unique identifier and at least one identity attribute (e.g., normalized email, phone number)
The dataset must be eligible as a Graph Studio source — the Edge Builder only lists datasets tagged for Graph Studio, and the Graph Builder only lists edges datasets. See Source eligibility if a source is missing from a picker.
A Snowflake or AWS data plane

Example dataset

This guide uses a CRM dataset called OFFICE_CRM:

CUSTOMER_ID	FIRST_NAME	LAST_NAME	EMAIL	PHONE
CRM-001	Michael	Scott	[email protected]	(570) 555-1234
CRM-002	Dwight	Schrute	[email protected]	15705552345
CRM-003	Jim	Halpert	[email protected]	5705553456
CRM-004	Michael	Scott	[email protected]	(570) 555-9876
CRM-005	Pam	Beesly	[email protected]	(570) 555-4567
CRM-006	Michael	Scott	[email protected]	(570) 555-9876

Michael Scott appears three times with different email and phone combinations. The goal is to resolve all three records into a single identity. The dataset is mapped to Rosetta Stone attributes for Unique Identifier (using CUSTOMER_ID), Normalized Email, Clear Text E.164 Phone Number, and Person Name.

Step 1: Build edges

Edges define how records connect to each other through shared identifiers. Navigate to My Data > Graph Studio and select the Edge builder tab.

Add a source dataset

Click Select Sources and choose your dataset. Set a source ID type (a label like OFFICE_CRM that identifies this system) and choose the source ID field (CUSTOMER_ID).

Choose target IDs

Target IDs are the Rosetta Stone attributes that serve as connection points. When two records share the same target ID value, the graph connects them.Target IDs are grouped — each group acts as a single connection type. For this example, add two target ID groups:

Normalized Email — connects records that share the same email address
Clear Text E.164 Phone Number + Person Name > first_name — connects records that share both the same phone number and first name. Combining these into one group means both values must match for a connection, which is more precise than matching on phone alone.

Finalize and build

Name the edge dataset (e.g., office_crm_edges) and click Build Edges.

Step 2: Build the graph

Switch to the Graph builder tab. This takes your edges and runs a connected components algorithm to discover which records belong to the same person.

Select input sources

Click Select Sources and choose the edges dataset you just created.

Review algorithm parameters

The defaults work well for most use cases. You can adjust max component size (caps how many records can merge into one identity), max iterations, and max degree threshold (excludes overly-connected nodes like shared corporate emails) if needed later.Optionally, use Exclusive attributes to pick one or more target ID types that must hold a single value per identity (for example, an SSN token or exact date of birth). After the connected-components pass converges, any component whose vertices disagree on a declared attribute is split apart, targeting known overmerge cases without dropping legitimate connections. Only first-party target ID types present on your input edges are selectable; leaving the field empty preserves the default behavior.

Finalize and build

Name the graph (e.g., office_crm_graph), choose a refresh schedule, and click Build Graph +.

Results

The graph resolves the six CRM records into four identities:

Identity	Records	How they connected
Identity 1	CRM-001, CRM-004, CRM-006	All three Michael Scott records — CRM-001 and CRM-006 share the same email; CRM-004 and CRM-006 share the same phone + first name
Identity 2	CRM-002	Dwight Schrute
Identity 3	CRM-003	Jim Halpert
Identity 4	CRM-005	Pam Beesly

Michael Scott’s three records are resolved into a single identity even though no two records share both the same email and phone — the graph follows transitive connections across shared values to link them together. Dwight, Jim, and Pam remain as separate identities because they have no overlapping identifiers with other records in this dataset.

Next steps

Add third-party data — Include an access rule as an additional source in the Edge Builder to introduce connections your first-party data cannot see on its own. See Graph Enrichment.
Set up recurring builds — Use a refresh schedule to keep your graph current as new data arrives.

Overview

Platform

Data Ingestion

Data Planes

Querying with NQL

Data Collaboration

Identity Resolution

Data Activation

Audience Studio

Connectors

Compliance

Rosetta Stone

Account Settings

Workflows

Webhooks

Data Collaboration MCP Server

Tools

SDKs

Prerequisites

Example dataset

Step 1: Build edges

Step 2: Build the graph

Results

Next steps

​Prerequisites

​Example dataset

​Step 1: Build edges

​Step 2: Build the graph

​Results

​Next steps

Prerequisites

Example dataset

Step 1: Build edges

Step 2: Build the graph

Results

Next steps