Fairness Metrics
Algorithmic fairness cannot be determined by intuition — it requires precise mathematical definitions that can be measured and compared across groups. The challenge is that there are multiple distinct, internally consistent fairness criteria, and they are often mutually incompatible. Choosing the right metric for your context is not a technical decision — it is a values decision that must be made explicitly by the relevant stakeholders, including affected communities.
Notation and Setup
For a binary classification problem:
- Y — true outcome (1 = positive, 0 = negative)
- Ŷ — model prediction (1 = predicted positive, 0 = predicted negative)
- A — sensitive attribute (e.g., race, gender, age group)
- S — model score (continuous probability output before thresholding)
- Groups a₀ and a₁ — the two demographic groups being compared
Group Fairness Metrics
Group fairness (also called statistical fairness) requires that some statistical property of the model's predictions is equal across demographic groups.
| Metric | Definition | When to use |
|---|---|---|
| Demographic Parity | P(Ŷ=1 | A=a₀) = P(Ŷ=1 | A=a₁) Equal positive prediction rates across groups | Anti-discrimination in access decisions; when historical outcomes are themselves discriminatory |
| Equal Opportunity | P(Ŷ=1 | Y=1, A=a₀) = P(Ŷ=1 | Y=1, A=a₁) Equal true positive rates (recall) across groups | High-stakes classification where failing to identify true positives causes harm (medical screening, fraud detection for victims) |
| Equalized Odds | Equal TPR and FPR across groups simultaneously | When both false positive and false negative harms are significant — stronger than equal opportunity |
| Predictive Parity | P(Y=1 | Ŷ=1, A=a₀) = P(Y=1 | Ŷ=1, A=a₁) Equal precision (PPV) across groups | When a positive prediction triggers a costly intervention — both groups should receive equally reliable positive predictions |
Calibration
A model is calibrated if its score output S represents a true probability of the positive outcome. A score of 0.7 should mean the event occurs 70% of the time for examples with that score. Calibration across groups requires this to hold within each demographic group separately.
Why calibration matters
Decision-makers using model scores to inform (not just automate) decisions rely on the score representing a meaningful probability. If a model is miscalibrated for Group B, a score of 0.7 for Group B actually means a different underlying risk than 0.7 for Group A.
Calibration vs equalized odds conflict
The Chouldechova (2017) result shows that if base rates differ between groups, a well-calibrated model cannot simultaneously achieve equal false positive rates and equal false negative rates across groups. This is a mathematical impossibility, not an engineering failure.
Individual Fairness
Individual fairness shifts the level of analysis from groups to individuals: similar individuals should receive similar predictions. Formally: if d(x₁, x₂) is small (individuals are similar on relevant features), then |S(x₁) - S(x₂)| should also be small.
- Strengths: Captures the intuition that discrimination is about treating similar people differently; compatible with contexts where group-level fairness is legally constrained
- Weaknesses: Defining the similarity metric d is itself a value-laden choice; hard to audit at scale; does not guarantee aggregate equity across groups
- Counterfactual fairness: A related concept — the prediction would be the same in the counterfactual world where the individual belonged to a different demographic group, all else equal
Impossibility Theorems
Multiple mathematical results prove that common fairness criteria cannot be satisfied simultaneously when base rates (prevalence of the positive outcome) differ across groups:
| Result | What it shows | Implication |
|---|---|---|
| Chouldechova (2017) | Calibration + equal FPR + equal FNR are mutually exclusive when base rates differ | Risk score tools like COMPAS cannot satisfy all three simultaneously |
| Kleinberg et al. (2016) | Calibration within groups is incompatible with equalised odds except in degenerate cases | You must choose which fairness criterion matters most for your use case |
| Demographic Parity vs Accuracy | Enforcing demographic parity reduces accuracy when base rates differ (Hardt et al., 2016) | Fairness interventions have a measurable cost; this cost is ethically worth paying in many contexts |
These results do not mean fairness is impossible. They mean that fairness criteria encode different value choices, and teams must explicitly choose which harms to prioritise minimising — typically in consultation with affected communities and domain experts.
Choosing the Right Metric
| Use case | Harm to avoid | Preferred metric |
|---|---|---|
| Hiring or lending | Unequal access to opportunity | Demographic parity or equal opportunity |
| Medical screening | Missing disease in one group (false negatives) | Equal opportunity (equal recall) |
| Criminal risk scoring | False accusations (false positives) | Equal FPR (component of equalized odds) |
| Fraud detection | Over-flagging legitimate transactions | Calibration + equal precision |
| Content moderation | Disproportionate suppression of one group's speech | Equalized odds across demographic groups |
Checklist: Do You Understand This?
- What is the difference between demographic parity and equal opportunity?
- Define equalized odds — what two conditions must simultaneously hold?
- What does calibration across groups mean, and why is it important when humans use scores to make decisions?
- State the Chouldechova impossibility result in plain language.
- When base rates differ across groups, which fairness criteria are mutually exclusive?
- For a medical screening use case, which fairness metric should you prioritise and why?
- What does individual fairness require, and what makes it difficult to operationalise?