Advanced

Differential Privacy

Differential privacy (DP) is a mathematical framework that provides a rigorous, quantifiable guarantee about how much information about any individual is revealed by a computation. Unlike anonymisation (which can often be reversed) or access controls (which are binary), DP provides a formal bound on privacy loss that holds even against an adversary with arbitrary auxiliary information. It is increasingly required by regulators and has been adopted by Apple, Google, Meta, and the US Census Bureau for sensitive statistical computations.

The Formal Definition

A randomised mechanism M is (ε, δ)-differentially private if for any two datasets D and D' that differ by at most one record (called adjacent datasets), and for any set of possible outputs S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S] + δ

ε (epsilon) — privacy budget: Controls the maximum privacy loss. Lower ε = stronger privacy. ε = 0 means identical output distributions (perfect privacy, useless mechanism). Typical values in practice: 0.1 to 10.
δ (delta) — failure probability: The probability the ε bound is violated. Must be cryptographically small — typically δ < 1/n where n is the dataset size. Pure DP sets δ = 0; approximate DP allows small δ.
Intuition: An observer who sees the output of M cannot determine with high confidence whether any specific individual's record was in the input dataset.

Noise Mechanisms

DP is achieved by adding carefully calibrated random noise to the output of a computation. The noise magnitude is determined by the sensitivity of the computation — how much the output can change if a single record is added or removed.

Laplace mechanism (pure DP)

Add noise drawn from the Laplace distribution. Scale parameter = sensitivity / ε. Achieves ε-DP (δ = 0).

Best for: Numerical queries (counting queries, histograms, sums) where outputs are real numbers. The most interpretable mechanism.

Gaussian mechanism (approximate DP)

Add noise drawn from the Gaussian (normal) distribution. Scale parameter = sensitivity · √(2 ln(1.25/δ)) / ε. Achieves (ε, δ)-DP.

Best for: High-dimensional outputs like gradient vectors (used in DP-SGD). Gaussian noise composes better with advanced composition theorems.

Exponential mechanism

For selecting from a set of categorical outputs. Samples outputs with probability proportional to e^(ε · u(x, r) / 2Δu) where u is a quality score.

Best for: Differentially private selection problems — choosing the best query response, hyperparameter, or model from a set.

Randomised response

Classic mechanism for local DP: each respondent answers truthfully with probability p, randomly otherwise. Provides plausible deniability for each individual response.

Best for: Collecting sensitive survey data; local differential privacy where the data collector cannot be trusted.

Local vs Global Differential Privacy

Dimension	Global DP (central model)	Local DP
Trust model	Trusted aggregator receives raw data; applies DP before publishing	No trusted aggregator — noise added on device before data leaves; even the server never sees true values
Privacy guarantee	Protects against anyone who sees the published output	Protects against the data collector and anyone downstream
Utility	Better accuracy for the same ε — noise added once to the aggregate	Lower accuracy — noise must be sufficient to hide each individual record, not just the aggregate
Real-world use	US Census Bureau; research datasets; private model training (DP-SGD)	Apple (iOS telemetry); Google RAPPOR (Chrome usage statistics)

DP-SGD: Differential Privacy for Model Training

DP-SGD (Abadi et al., 2016) is the canonical algorithm for training neural networks with differential privacy. It modifies the standard stochastic gradient descent algorithm in two ways:

Per-example gradient clipping: For each training example in a mini-batch, compute the gradient and clip it to a maximum L2 norm C. This bounds the sensitivity of the gradient computation to C.
Gaussian noise addition: Add Gaussian noise with scale σ = C · (noise_multiplier) to the clipped, summed gradients before the update step.
Privacy accounting: Track the cumulative privacy loss (ε, δ) across all training steps using the moments accountant or Rényi DP composition theorems. Training stops when the budget is exhausted.

DP-SGD adoption for LLMs

Fine-tuning LLMs on sensitive data (medical notes, legal documents, financial records) with DP-SGD provides a formal guarantee that the fine-tuned model does not memorise individual records
Google has published DP-SGD results on fine-tuning BERT and T5
Apple uses DP for on-device learning in iOS
Libraries: Google's DP library, Opacus (PyTorch), TF Privacy

DP-SGD limitations

Accuracy cost is significant at strong privacy levels (ε < 1)
Training is slower — per-example gradient computation is 2–10x more expensive than standard batched gradients
Privacy budget accounting limits the number of training epochs
Amplification by sampling (subsampled DP) requires careful implementation to avoid errors

Choosing Epsilon in Practice

ε value	Privacy level	Context
ε ≤ 1	Strong — research standard for sensitive data	Medical records, financial data, politically sensitive data
1 < ε ≤ 10	Moderate — typical in ML deployments balancing utility	Most production DP deployments; Apple iOS uses ε ≤ 8 for some features
ε > 10	Weak — provides minimal protection in practice	Useful for regulatory compliance framing but provides weak individual protection

There is no universal right ε. The choice depends on the sensitivity of the data, the threat model, the required utility, and the regulatory context. Regulators are beginning to specify ε requirements (some EU healthcare research frameworks suggest ε ≤ 1 for medical data).

Checklist: Do You Understand This?

State the (ε, δ)-differential privacy definition in plain language — what does ε bound?
What is the difference between the Laplace and Gaussian mechanisms, and when would you use each?
What is the key difference between local and global differential privacy?
Describe the two modifications DP-SGD makes to standard SGD and why each is needed.
Why does a lower ε provide stronger privacy but lower accuracy?
At what ε value would you use DP for training on medical records, and why?