Advanced

Fairness Metrics

Algorithmic fairness cannot be determined by intuition — it requires precise mathematical definitions that can be measured and compared across groups. The challenge is that there are multiple distinct, internally consistent fairness criteria, and they are often mutually incompatible. Choosing the right metric for your context is not a technical decision — it is a values decision that must be made explicitly by the relevant stakeholders, including affected communities.

Notation and Setup

For a binary classification problem:

Y — true outcome (1 = positive, 0 = negative)
Ŷ — model prediction (1 = predicted positive, 0 = predicted negative)
A — sensitive attribute (e.g., race, gender, age group)
S — model score (continuous probability output before thresholding)
Groups a₀ and a₁ — the two demographic groups being compared

Group Fairness Metrics

Group fairness (also called statistical fairness) requires that some statistical property of the model's predictions is equal across demographic groups.

Metric	Definition	When to use
Demographic Parity	P(Ŷ=1 \| A=a₀) = P(Ŷ=1 \| A=a₁) Equal positive prediction rates across groups	Anti-discrimination in access decisions; when historical outcomes are themselves discriminatory
Equal Opportunity	P(Ŷ=1 \| Y=1, A=a₀) = P(Ŷ=1 \| Y=1, A=a₁) Equal true positive rates (recall) across groups	High-stakes classification where failing to identify true positives causes harm (medical screening, fraud detection for victims)
Equalized Odds	Equal TPR and FPR across groups simultaneously	When both false positive and false negative harms are significant — stronger than equal opportunity
Predictive Parity	P(Y=1 \| Ŷ=1, A=a₀) = P(Y=1 \| Ŷ=1, A=a₁) Equal precision (PPV) across groups	When a positive prediction triggers a costly intervention — both groups should receive equally reliable positive predictions

Calibration

A model is calibrated if its score output S represents a true probability of the positive outcome. A score of 0.7 should mean the event occurs 70% of the time for examples with that score. Calibration across groups requires this to hold within each demographic group separately.

Why calibration matters

Decision-makers using model scores to inform (not just automate) decisions rely on the score representing a meaningful probability. If a model is miscalibrated for Group B, a score of 0.7 for Group B actually means a different underlying risk than 0.7 for Group A.

Calibration vs equalized odds conflict

The Chouldechova (2017) result shows that if base rates differ between groups, a well-calibrated model cannot simultaneously achieve equal false positive rates and equal false negative rates across groups. This is a mathematical impossibility, not an engineering failure.

Individual Fairness

Individual fairness shifts the level of analysis from groups to individuals: similar individuals should receive similar predictions. Formally: if d(x₁, x₂) is small (individuals are similar on relevant features), then |S(x₁) - S(x₂)| should also be small.

Strengths: Captures the intuition that discrimination is about treating similar people differently; compatible with contexts where group-level fairness is legally constrained
Weaknesses: Defining the similarity metric d is itself a value-laden choice; hard to audit at scale; does not guarantee aggregate equity across groups
Counterfactual fairness: A related concept — the prediction would be the same in the counterfactual world where the individual belonged to a different demographic group, all else equal

Impossibility Theorems

Multiple mathematical results prove that common fairness criteria cannot be satisfied simultaneously when base rates (prevalence of the positive outcome) differ across groups:

Result	What it shows	Implication
Chouldechova (2017)	Calibration + equal FPR + equal FNR are mutually exclusive when base rates differ	Risk score tools like COMPAS cannot satisfy all three simultaneously
Kleinberg et al. (2016)	Calibration within groups is incompatible with equalised odds except in degenerate cases	You must choose which fairness criterion matters most for your use case
Demographic Parity vs Accuracy	Enforcing demographic parity reduces accuracy when base rates differ (Hardt et al., 2016)	Fairness interventions have a measurable cost; this cost is ethically worth paying in many contexts

These results do not mean fairness is impossible. They mean that fairness criteria encode different value choices, and teams must explicitly choose which harms to prioritise minimising — typically in consultation with affected communities and domain experts.

Choosing the Right Metric

Use case	Harm to avoid	Preferred metric
Hiring or lending	Unequal access to opportunity	Demographic parity or equal opportunity
Medical screening	Missing disease in one group (false negatives)	Equal opportunity (equal recall)
Criminal risk scoring	False accusations (false positives)	Equal FPR (component of equalized odds)
Fraud detection	Over-flagging legitimate transactions	Calibration + equal precision
Content moderation	Disproportionate suppression of one group's speech	Equalized odds across demographic groups

Checklist: Do You Understand This?

What is the difference between demographic parity and equal opportunity?
Define equalized odds — what two conditions must simultaneously hold?
What does calibration across groups mean, and why is it important when humans use scores to make decisions?
State the Chouldechova impossibility result in plain language.
When base rates differ across groups, which fairness criteria are mutually exclusive?
For a medical screening use case, which fairness metric should you prioritise and why?
What does individual fairness require, and what makes it difficult to operationalise?