🧠 All Things AI
Advanced

Fairness Metrics

Algorithmic fairness cannot be determined by intuition — it requires precise mathematical definitions that can be measured and compared across groups. The challenge is that there are multiple distinct, internally consistent fairness criteria, and they are often mutually incompatible. Choosing the right metric for your context is not a technical decision — it is a values decision that must be made explicitly by the relevant stakeholders, including affected communities.

Notation and Setup

For a binary classification problem:

  • Y — true outcome (1 = positive, 0 = negative)
  • Ŷ — model prediction (1 = predicted positive, 0 = predicted negative)
  • A — sensitive attribute (e.g., race, gender, age group)
  • S — model score (continuous probability output before thresholding)
  • Groups a₀ and a₁ — the two demographic groups being compared

Group Fairness Metrics

Group fairness (also called statistical fairness) requires that some statistical property of the model's predictions is equal across demographic groups.

MetricDefinitionWhen to use
Demographic ParityP(Ŷ=1 | A=a₀) = P(Ŷ=1 | A=a₁)
Equal positive prediction rates across groups
Anti-discrimination in access decisions; when historical outcomes are themselves discriminatory
Equal OpportunityP(Ŷ=1 | Y=1, A=a₀) = P(Ŷ=1 | Y=1, A=a₁)
Equal true positive rates (recall) across groups
High-stakes classification where failing to identify true positives causes harm (medical screening, fraud detection for victims)
Equalized OddsEqual TPR and FPR across groups simultaneouslyWhen both false positive and false negative harms are significant — stronger than equal opportunity
Predictive ParityP(Y=1 | Ŷ=1, A=a₀) = P(Y=1 | Ŷ=1, A=a₁)
Equal precision (PPV) across groups
When a positive prediction triggers a costly intervention — both groups should receive equally reliable positive predictions

Calibration

A model is calibrated if its score output S represents a true probability of the positive outcome. A score of 0.7 should mean the event occurs 70% of the time for examples with that score. Calibration across groups requires this to hold within each demographic group separately.

Why calibration matters

Decision-makers using model scores to inform (not just automate) decisions rely on the score representing a meaningful probability. If a model is miscalibrated for Group B, a score of 0.7 for Group B actually means a different underlying risk than 0.7 for Group A.

Calibration vs equalized odds conflict

The Chouldechova (2017) result shows that if base rates differ between groups, a well-calibrated model cannot simultaneously achieve equal false positive rates and equal false negative rates across groups. This is a mathematical impossibility, not an engineering failure.

Individual Fairness

Individual fairness shifts the level of analysis from groups to individuals: similar individuals should receive similar predictions. Formally: if d(x₁, x₂) is small (individuals are similar on relevant features), then |S(x₁) - S(x₂)| should also be small.

  • Strengths: Captures the intuition that discrimination is about treating similar people differently; compatible with contexts where group-level fairness is legally constrained
  • Weaknesses: Defining the similarity metric d is itself a value-laden choice; hard to audit at scale; does not guarantee aggregate equity across groups
  • Counterfactual fairness: A related concept — the prediction would be the same in the counterfactual world where the individual belonged to a different demographic group, all else equal

Impossibility Theorems

Multiple mathematical results prove that common fairness criteria cannot be satisfied simultaneously when base rates (prevalence of the positive outcome) differ across groups:

ResultWhat it showsImplication
Chouldechova (2017)Calibration + equal FPR + equal FNR are mutually exclusive when base rates differRisk score tools like COMPAS cannot satisfy all three simultaneously
Kleinberg et al. (2016)Calibration within groups is incompatible with equalised odds except in degenerate casesYou must choose which fairness criterion matters most for your use case
Demographic Parity vs AccuracyEnforcing demographic parity reduces accuracy when base rates differ (Hardt et al., 2016)Fairness interventions have a measurable cost; this cost is ethically worth paying in many contexts

These results do not mean fairness is impossible. They mean that fairness criteria encode different value choices, and teams must explicitly choose which harms to prioritise minimising — typically in consultation with affected communities and domain experts.

Choosing the Right Metric

Use caseHarm to avoidPreferred metric
Hiring or lendingUnequal access to opportunityDemographic parity or equal opportunity
Medical screeningMissing disease in one group (false negatives)Equal opportunity (equal recall)
Criminal risk scoringFalse accusations (false positives)Equal FPR (component of equalized odds)
Fraud detectionOver-flagging legitimate transactionsCalibration + equal precision
Content moderationDisproportionate suppression of one group's speechEqualized odds across demographic groups

Checklist: Do You Understand This?

  • What is the difference between demographic parity and equal opportunity?
  • Define equalized odds — what two conditions must simultaneously hold?
  • What does calibration across groups mean, and why is it important when humans use scores to make decisions?
  • State the Chouldechova impossibility result in plain language.
  • When base rates differ across groups, which fairness criteria are mutually exclusive?
  • For a medical screening use case, which fairness metric should you prioritise and why?
  • What does individual fairness require, and what makes it difficult to operationalise?