Skip to main content
Data collaboration often involves matching records across different organizations—connecting a customer in your database to their activity in a partner’s dataset. But sharing raw email addresses or phone numbers creates privacy and compliance risks. Pseudonymization solves this problem by replacing identifiable information with hashed values that enable matching without exposing the underlying data.

Why pseudonymization matters

When organizations collaborate on data, they need a way to identify common records without sharing sensitive information. Consider two companies that want to find overlapping customers:
  • Company A has email addresses for their customers
  • Company B has email addresses for their customers
  • Both want to find the intersection without revealing their full customer lists
By hashing email addresses before sharing, both companies can find matches (identical hashes indicate identical emails) without either party seeing the other’s raw data.

Regulatory compliance

Privacy regulations like GDPR and CCPA recognize pseudonymization as a data protection technique. While pseudonymized data is still considered personal data under these regulations, it receives favorable treatment because:
  • The data cannot be attributed to a specific individual without additional information
  • It reduces risk if the data is exposed or breached
  • It demonstrates a commitment to data minimization principles

Reduced breach impact

If hashed data is exposed, attackers cannot directly use it to contact or identify individuals. While hashing is not encryption and determined attackers could attempt to reverse common values, it significantly raises the barrier compared to raw PII.

How hashing works

A hash function is a one-way mathematical transformation that converts input data into a fixed-length string of characters. The key properties that make hashing useful for pseudonymization:

Deterministic

The same input always produces the same output. This is essential for matching—if two organizations hash the same email address, they get identical hashes.
[email protected] → 06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3
[email protected] → 06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3  (same hash)

One-way (non-reversible)

You cannot mathematically reverse a hash to recover the original input. There is no “unhash” function.
06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3 → ???  (cannot reverse)

Collision-resistant

It is computationally infeasible to find two different inputs that produce the same hash. This ensures that matching hashes truly indicate matching inputs.

Sensitive to input changes

Even tiny changes to the input produce completely different hashes. This is why pre-formatting (lowercase, whitespace removal) is critical—[email protected] and [email protected] produce entirely different hashes.
[email protected]  → 06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3
[email protected]  → 8f5a8e5e9a8f5e8a9e5f8a5e9a8f5e8a9e5f8a5e9a8f5e8a9e5f8a5e9a8f5e8a  (completely different)

Supported hash algorithms

Narrative supports three widely-used hash algorithms. Each produces a different output length and has different characteristics:
AlgorithmOutputStatusNotes
MD5128-bit (32 hex chars)LegacyFast, widely supported. Cryptographically broken but acceptable for non-security use cases like identifier matching.
SHA-1160-bit (40 hex chars)LegacyMore secure than MD5. Also cryptographically deprecated but acceptable for identifier matching.
SHA-256256-bit (64 hex chars)RecommendedPart of the SHA-2 family. Currently considered secure and is the preferred choice for new implementations.
For maximum compatibility when matching across datasets, generate all three hash formats. Different organizations may have standardized on different algorithms.

Pseudonymization vs. anonymization

These terms are often confused but have important distinctions:
AspectPseudonymizationAnonymization
ReversibilityPossible with additional informationIrreversible by design
Matching capabilityYes—same input produces same outputNo—cannot match records
Regulatory statusStill personal data (GDPR)Not personal data
Use caseData collaboration, matchingAggregate analytics, public datasets
Hashing is a pseudonymization technique, not anonymization. The hashed value still represents a specific individual—you just cannot determine who without additional context or resources.

Why not anonymize?

True anonymization would prevent record matching entirely, defeating the purpose of data collaboration. Pseudonymization strikes a balance: it protects individual privacy while enabling the business value of finding common records across datasets.

Security considerations

While hashing provides meaningful privacy protection, it’s important to understand its limitations:

Dictionary attacks

Common email addresses (like [email protected]) could theoretically be discovered by hashing a dictionary of likely values and comparing. This is why:
  • Narrative never exposes raw hash values to unauthorized parties
  • Access controls govern who can query hashed identifiers
  • The platform’s security model adds additional protections

Not a substitute for access controls

Pseudonymization reduces risk but doesn’t eliminate the need for proper data governance. Always combine hashing with:
  • Role-based access controls
  • Query logging and auditing
  • Data sharing agreements