🧠 All Things AI
Advanced

Differential Privacy

Differential privacy (DP) is a mathematical framework that provides a rigorous, quantifiable guarantee about how much information about any individual is revealed by a computation. Unlike anonymisation (which can often be reversed) or access controls (which are binary), DP provides a formal bound on privacy loss that holds even against an adversary with arbitrary auxiliary information. It is increasingly required by regulators and has been adopted by Apple, Google, Meta, and the US Census Bureau for sensitive statistical computations.

The Formal Definition

A randomised mechanism M is (ε, δ)-differentially private if for any two datasets D and D' that differ by at most one record (called adjacent datasets), and for any set of possible outputs S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S] + δ

  • ε (epsilon) — privacy budget: Controls the maximum privacy loss. Lower ε = stronger privacy. ε = 0 means identical output distributions (perfect privacy, useless mechanism). Typical values in practice: 0.1 to 10.
  • δ (delta) — failure probability: The probability the ε bound is violated. Must be cryptographically small — typically δ < 1/n where n is the dataset size. Pure DP sets δ = 0; approximate DP allows small δ.
  • Intuition: An observer who sees the output of M cannot determine with high confidence whether any specific individual's record was in the input dataset.

Noise Mechanisms

DP is achieved by adding carefully calibrated random noise to the output of a computation. The noise magnitude is determined by the sensitivity of the computation — how much the output can change if a single record is added or removed.

Laplace mechanism (pure DP)

Add noise drawn from the Laplace distribution. Scale parameter = sensitivity / ε. Achieves ε-DP (δ = 0).

Best for: Numerical queries (counting queries, histograms, sums) where outputs are real numbers. The most interpretable mechanism.

Gaussian mechanism (approximate DP)

Add noise drawn from the Gaussian (normal) distribution. Scale parameter = sensitivity · √(2 ln(1.25/δ)) / ε. Achieves (ε, δ)-DP.

Best for: High-dimensional outputs like gradient vectors (used in DP-SGD). Gaussian noise composes better with advanced composition theorems.

Exponential mechanism

For selecting from a set of categorical outputs. Samples outputs with probability proportional to e^(ε · u(x, r) / 2Δu) where u is a quality score.

Best for: Differentially private selection problems — choosing the best query response, hyperparameter, or model from a set.

Randomised response

Classic mechanism for local DP: each respondent answers truthfully with probability p, randomly otherwise. Provides plausible deniability for each individual response.

Best for: Collecting sensitive survey data; local differential privacy where the data collector cannot be trusted.

Local vs Global Differential Privacy

DimensionGlobal DP (central model)Local DP
Trust modelTrusted aggregator receives raw data; applies DP before publishingNo trusted aggregator — noise added on device before data leaves; even the server never sees true values
Privacy guaranteeProtects against anyone who sees the published outputProtects against the data collector and anyone downstream
UtilityBetter accuracy for the same ε — noise added once to the aggregateLower accuracy — noise must be sufficient to hide each individual record, not just the aggregate
Real-world useUS Census Bureau; research datasets; private model training (DP-SGD)Apple (iOS telemetry); Google RAPPOR (Chrome usage statistics)

DP-SGD: Differential Privacy for Model Training

DP-SGD (Abadi et al., 2016) is the canonical algorithm for training neural networks with differential privacy. It modifies the standard stochastic gradient descent algorithm in two ways:

  1. Per-example gradient clipping: For each training example in a mini-batch, compute the gradient and clip it to a maximum L2 norm C. This bounds the sensitivity of the gradient computation to C.
  2. Gaussian noise addition: Add Gaussian noise with scale σ = C · (noise_multiplier) to the clipped, summed gradients before the update step.
  3. Privacy accounting: Track the cumulative privacy loss (ε, δ) across all training steps using the moments accountant or Rényi DP composition theorems. Training stops when the budget is exhausted.

DP-SGD adoption for LLMs

  • Fine-tuning LLMs on sensitive data (medical notes, legal documents, financial records) with DP-SGD provides a formal guarantee that the fine-tuned model does not memorise individual records
  • Google has published DP-SGD results on fine-tuning BERT and T5
  • Apple uses DP for on-device learning in iOS
  • Libraries: Google's DP library, Opacus (PyTorch), TF Privacy

DP-SGD limitations

  • Accuracy cost is significant at strong privacy levels (ε < 1)
  • Training is slower — per-example gradient computation is 2–10x more expensive than standard batched gradients
  • Privacy budget accounting limits the number of training epochs
  • Amplification by sampling (subsampled DP) requires careful implementation to avoid errors

Choosing Epsilon in Practice

ε valuePrivacy levelContext
ε ≤ 1Strong — research standard for sensitive dataMedical records, financial data, politically sensitive data
1 < ε ≤ 10Moderate — typical in ML deployments balancing utilityMost production DP deployments; Apple iOS uses ε ≤ 8 for some features
ε > 10Weak — provides minimal protection in practiceUseful for regulatory compliance framing but provides weak individual protection

There is no universal right ε. The choice depends on the sensitivity of the data, the threat model, the required utility, and the regulatory context. Regulators are beginning to specify ε requirements (some EU healthcare research frameworks suggest ε ≤ 1 for medical data).

Checklist: Do You Understand This?

  • State the (ε, δ)-differential privacy definition in plain language — what does ε bound?
  • What is the difference between the Laplace and Gaussian mechanisms, and when would you use each?
  • What is the key difference between local and global differential privacy?
  • Describe the two modifications DP-SGD makes to standard SGD and why each is needed.
  • Why does a lower ε provide stronger privacy but lower accuracy?
  • At what ε value would you use DP for training on medical records, and why?