Hashing is the process of transforming input data into a fixed-length output using a cryptographic hash function. In data collaboration, hashing enables organizations to share and match data without exposing sensitive information like email addresses or phone numbers.
Why hashing matters for data collaboration
When organizations want to find overlapping customers or match records across datasets, they need a common key. But sharing raw email addresses or other personally identifiable information (PII) creates privacy and compliance risks.
Hashing solves this by:
- Converting PII into non-reversible pseudonyms
- Enabling deterministic matching (same input = same output)
- Allowing data collaboration without exposing underlying identifiers
Properties of hash functions
Hash functions have specific properties that make them useful for data protection:
Deterministic (consistent)
The same input always produces the same output, regardless of when or where the hash is computed. This enables matching—if two organizations hash the same email address, they get identical hashes.
[email protected] → 06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3
[email protected] → 06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3 (always same)
One-way (irreversible)
You cannot mathematically reverse a hash to recover the original input. There is no “unhash” function. This protects the underlying PII.
06a240d11cc201676da976f7b49341181fd180da37cbe40a77432c0a366c80c3 → ??? (cannot reverse)
Collision-resistant
Different inputs produce different outputs. While theoretical collisions exist (the output space is finite), well-designed hash functions make them computationally infeasible to find.
Computationally efficient
Hashing billions of records is practical. The algorithms are designed to be fast while maintaining security properties.
Even tiny changes produce completely different outputs. This is why input normalization matters—[email protected] and [email protected] produce entirely different hashes.
Common hash algorithms
Three hash algorithms are commonly used for pseudonymization in data collaboration:
| Algorithm | Output Length | Status | Example Output |
|---|
| MD5 | 32 hex chars | Legacy | 0c036b871e3c66c1724f68fd007c4718 |
| SHA-1 | 40 hex chars | Legacy | fafe72dfca878eb2084fb7478a44d279f7895b9b |
| SHA-256 | 64 hex chars | Recommended | ed2e59c337d01185f388a4e9334d6f2e5cb29652f222afe4b692582d2e1c3fce |
For maximum interoperability, generate all three hash formats when preparing data for collaboration. Different partners may have standardized on different algorithms.
Algorithm considerations
- MD5 and SHA-1 are cryptographically broken for security purposes but remain acceptable for identifier matching where collision attacks aren’t a concern
- SHA-256 (part of the SHA-2 family) is currently considered secure and is recommended for new implementations
- All three produce deterministic outputs suitable for matching
Hash functions are sensitive to their input—any difference produces a completely different output. Before hashing, normalize your data:
Case sensitivity
Standardize to lowercase for case-insensitive identifiers like email addresses:
MD5("narrative") → a936fba0061bf85904722500de008e51
MD5("NARRATIVE") → 2402a932631e086f028721943b7350c8 (different!)
MD5("Narrative") → 0c036b871e3c66c1724f68fd007c4718 (different!)
Whitespace
Remove leading and trailing whitespace:
Normalize to E.164 format before hashing:
Original: (415) 555-1234
Normalized: +14155551234
Then hash: SHA256("+14155551234")
Security considerations
Dictionary attacks
While hash functions are irreversible, attackers can create dictionaries mapping common inputs to their hash outputs. If someone has a dictionary of all possible email hashes, they can look up your hash to find the original email.
This is most concerning for inputs with limited possibilities:
- Phone numbers: Only ~10 billion 10-digit numbers—feasible to hash them all
- Common emails: Popular email patterns could be pre-computed
- Short strings: Limited character combinations
Hashing alone may not provide sufficient protection for high-value identifiers with limited input spaces. Consider additional protections like Narrative ID for sensitive use cases.
Mitigation strategies
Narrative’s platform adds protections beyond simple hashing:
- Access controls limit who can query hashed identifiers
- Narrative ID provides per-partner encoding
- Query logging enables auditing of data access
Not a substitute for access controls
Hashing reduces risk but doesn’t eliminate the need for proper data governance. Always combine hashing with:
- Role-based access controls
- Query logging and auditing
- Data sharing agreements
Related content