Intermediate

Bias Testing

Declaring that a model "has been evaluated for bias" is not the same as having actually tested for it. Effective bias testing is a systematic process — not a one-time checkbox — that uses multiple complementary techniques to surface different types of disparities. This page covers the core testing methods and the tooling that supports them, plus how to integrate bias testing into the ML development lifecycle.

Slice Analysis

Slice analysis is the practice of disaggregating evaluation metrics by subgroup. Instead of reporting one overall accuracy number, you compute accuracy (precision, recall, F1, etc.) for each demographic subgroup or meaningful segment of the population.

How to run slice analysis

Identify the sensitive attributes relevant to your use case (age, gender, ethnicity, geography)
Ensure your test set is large enough per slice for meaningful statistics (minimum ~100 examples per group)
Compute the same metrics used for overall model evaluation, per slice
Flag any slice with a metric gap exceeding your defined threshold (e.g., >5pp recall gap)

Common pitfalls

Insufficient test data per slice — small groups yield high-variance estimates
Testing on the same distribution as training — slice analysis only catches what is in the test set
Intersectional groups ignored — "women over 65 with low income" is a slice that matters
Metrics chosen post-hoc to show fairness rather than pre-specified

Counterfactual Fairness Testing

Counterfactual testing asks: would the model output the same prediction if the sensitive attribute were changed while all other features were held constant? This directly operationalises the anti-discrimination intuition that decisions should not depend on protected characteristics.

Counterfactual test procedure

Take a test example and record the model prediction
Create a counterfactual version: change the sensitive attribute value (e.g., "male" → "female") while keeping all other features identical
Record the model prediction on the counterfactual
Flag cases where the prediction changes — these are potential bias signals
Aggregate the flip rate across your test set to get a counterfactual fairness score

For language models, counterfactual testing uses paired prompt templates: swap gendered pronouns, names associated with different ethnicities, or explicit demographic references and compare output sentiment, quality, or classification outcome. For example: "The [man/woman] applied for the loan. Based on this information, what is your assessment?" — responses should not systematically differ.

Adversarial Probing

Adversarial probing uses specifically constructed inputs to reveal stereotyping, differential treatment, or toxicity amplification in language models and classifiers.

Stereotype probing

Use fill-in-the-blank templates to test for stereotyped completions: "The [nurse/engineer/doctor] was good at their job because they were ___." Measure differential completion patterns across demographic variations of the template.

Toxicity amplification

Test whether toxic content detection or generation rates differ systematically by demographic group. A content moderation model that is more likely to flag dialect speech associated with one group is exhibiting disparate impact.

Word embedding association tests

WEAT (Word Embedding Association Test) and related methods measure the degree to which demographic-associated words are closer in embedding space to positive/negative attribute words — surfacing implicit associations in model representations.

Image generation probing

For image generation, run neutral prompts ("a CEO", "a nurse", "a criminal") and measure the demographic distribution of generated images. Count representation percentages across hundreds of generations.

Fairness Testing Tooling

Tool	Maintainer	Best for	Key capabilities
Fairlearn	Microsoft	Classification and regression with scikit-learn	MetricFrame for slice analysis; fairness-aware algorithms (Exponentiated Gradient, Threshold Optimizer)
AI Fairness 360 (AIF360)	IBM Research	Research and production bias analysis	70+ fairness metrics; pre-processing, in-processing, post-processing mitigation algorithms
What-If Tool	Google	Interactive exploration in notebooks and TF	Visual counterfactual editing; slice analysis dashboard; multiple fairness metric overlays
Aequitas	UChicago DSaPP	Audit reporting and policy decisions	Bias audit with statistical significance; exportable audit reports; interactive web UI
Responsible AI Toolbox	Microsoft	End-to-end RAI dashboard	Integrates fairness, error analysis, interpretability, causal analysis, and counterfactuals in one UI

Integrating Bias Testing into CI/CD

Bias testing should be automated and continuous — run on every model update, not just at initial deployment. Treat fairness metrics as first-class quality gates alongside accuracy metrics.

Define fairness thresholds

Before development: specify which metrics matter (demographic parity gap, recall gap), at what threshold a model fails the fairness gate, and for which groups. Document these as acceptance criteria.

Build stratified test set

Ensure test data includes sufficient examples per sensitive group. Tag examples with demographic labels (where legally permitted and ethically appropriate for your context). Freeze the test set — do not update it based on model results.

Automated fairness evaluation

After each training run: compute the full MetricFrame or AIF360 audit. Calculate group-level metric deltas. Compare against thresholds. Fail the pipeline if any metric exceeds the defined gap threshold.

Bias regression gate

Block model promotion to staging or production if fairness metrics regress vs the current production model — even if overall accuracy improves. Treat a fairness regression as a blocker, not a trade-off to be made ad hoc.

Mitigation and re-test

If a model fails the fairness gate: apply mitigation techniques (resampling, re-weighting, threshold adjustment, adversarial debiasing). Re-run the full evaluation. Document the mitigation applied and the residual gap.

Bias Mitigation Techniques

Pre-processing

Resampling training data to balance representation across groups; re-weighting examples; transforming features to remove group correlation. Applied before training begins.

In-processing

Adding fairness constraints to the training objective; adversarial debiasing (training an adversary to be unable to predict the sensitive attribute from model representations); multi-task fairness objectives.

Post-processing

Adjusting prediction thresholds per group to equalise a specific metric (e.g., set different score cut-offs for each group to achieve equal opportunity). Does not retrain the model — quickest to deploy.

Data augmentation

Generating additional training examples for underrepresented groups; counterfactual augmentation (adding flipped-attribute versions of existing examples to the training set to reduce spurious correlations).

Checklist: Do You Understand This?

What is slice analysis and what is the minimum test set size per slice for reliable results?
Describe counterfactual fairness testing: what does it test and how is it operationalised?
What is the difference between pre-processing, in-processing, and post-processing mitigation?
Name two open-source tools for fairness evaluation and what they are best used for.
What does it mean to treat a fairness metric as a quality gate in a CI/CD pipeline?
Why should intersectional subgroups (e.g., elderly women with low income) be considered in slice analysis?