🧠 All Things AI
Intermediate

Data Anonymization

Data anonymization attempts to remove or modify personal information from a dataset so that individuals can no longer be identified. In the context of AI, anonymization is applied to training data (to allow model training on sensitive data), to model outputs (to prevent the model from revealing personal information), and to evaluation datasets (to allow public benchmarking without privacy violation). However, anonymization is not a reliable privacy guarantee — most anonymization techniques are reversible under re-identification attacks.

Pseudonymisation vs Anonymization under GDPR

Pseudonymisation (GDPR Article 4(5))

Personal data is processed such that it can no longer be attributed to a specific data subject without the use of additional information — which is kept separately. The data remains personal data under GDPR. Pseudonymisation reduces risk but does NOT exempt from GDPR obligations.

Examples: Replacing names with a user ID; hashing email addresses; tokenising payment card numbers

Anonymization (GDPR Recital 26)

Data is anonymised if it has been processed in a way that individuals can no longer be identified — directly or indirectly — with reasonable effort. Truly anonymised data falls outside GDPR entirely. In practice, achieving true anonymization is extremely difficult.

The test: Could a determined adversary with access to auxiliary datasets re-identify individuals? If yes, the data is not truly anonymous.

k-Anonymity

k-anonymity (Sweeney, 2002) requires that every record in a dataset is indistinguishable from at least k-1 other records with respect to a set of quasi-identifying attributes (QIDs) — attributes that are not direct identifiers but could be combined to re-identify individuals (e.g., age, zip code, gender).

How k-anonymization is achieved

  • Generalisation: Replace specific values with less specific ranges (age 34 → "30-39"; zip code 94702 → "947**")
  • Suppression: Remove records or attribute values that cannot be generalised into a group of k matching records without excessive information loss

Limitations of k-anonymity

  • Homogeneity attack: If all k records in a group share the same sensitive attribute value, an adversary who knows someone is in the group can infer their sensitive value
  • Background knowledge attack: An adversary with external knowledge can narrow down which record within a k-group belongs to a target individual

l-Diversity and t-Closeness

l-diversity and t-closeness were developed to address the weaknesses of k-anonymity:

ModelRequirementLimitation addressed
l-diversityEach equivalence class (group of k-anonymous records) must contain at least l "well-represented" values of the sensitive attributePrevents homogeneity attack — the sensitive attribute has diversity within each group
t-closenessThe distribution of sensitive values in each equivalence class must be within t distance of the overall dataset distributionPrevents skewness attack — even if l-diverse, if rare values cluster in a group, inferring sensitive values remains possible

Despite these extensions, all three models share the fundamental weakness that they are defeated by re-identification attacks using auxiliary data — the Netflix Prize re-identification attack (Narayanan & Shmatikoff, 2008) demonstrated this convincingly.

Re-Identification Attacks

Why anonymization is often reversible

  • Auxiliary data linkage: An adversary can cross-reference an "anonymised" dataset with public data (social media, voter registration, hospital admissions) to re-identify records. The Netflix Prize dataset, anonymised by removing user names, was re-identified by linking to public IMDB reviews.
  • Quasi-identifier combinations: Latanya Sweeney (2000) showed that 87% of US citizens could be uniquely identified by just three attributes: gender, date of birth, and zip code.
  • Machine learning re-identification: ML models can infer sensitive attributes from supposedly non-sensitive features — inferring race from name, location from browsing history, or health status from purchase patterns.
  • Genomic data: Even single nucleotide polymorphisms (SNPs) — a tiny fraction of a genome — are sufficient to uniquely identify individuals. Genomic anonymization is widely considered infeasible.

Synthetic Data as an Alternative

Synthetic data — data generated by a model trained on real data, but which does not contain the original records — is increasingly used as an alternative or complement to anonymization:

Advantages over anonymization

  • No real records are present — re-identification is significantly harder
  • Can generate unlimited synthetic examples, addressing data scarcity
  • Can be generated to correct for imbalances in the original data
  • More flexible than anonymization — does not degrade with aggressive privacy requirements

Risks and limitations

  • Generative models can memorise and reproduce training data — synthetic data does not automatically guarantee privacy
  • Distribution fidelity: synthetic data may not capture rare subgroup patterns, introducing its own bias
  • Regulatory status is unclear in some jurisdictions — GDPR guidance on synthetic data as "truly anonymous" is not settled
  • Combining synthetic data with real data in a pipeline requires careful privacy accounting

Checklist: Do You Understand This?

  • Under GDPR, what is the difference between pseudonymisation and anonymization, and which one removes GDPR obligations?
  • Explain k-anonymity and describe the homogeneity attack that defeats it.
  • How does l-diversity extend k-anonymity, and what additional protection does t-closeness provide?
  • Describe the Netflix Prize re-identification attack — what does it reveal about anonymization?
  • Why is synthetic data not automatically private, and what risk does memorisation in generative models create?