🧠 All Things AI
Intermediate

Regression Testing

An LLM application can break without anyone touching the code. Provider model updates, changes to retrieved context, prompt edits by a non-engineer, or seasonal data drift can all silently degrade quality. Regression testing is the practice of running a fixed evaluation suite on every change — and blocking deployment when quality drops. This page covers the regression testing pipeline, what triggers a regression run, and how to detect prompt drift in production.

What Triggers a Regression Run

TriggerWhat it catchesHow often
PR / code pushPrompt changes, tool changes, retrieval changesEvery change
Model version updateProvider-side behaviour changes, capability shiftsOn provider announcement
Scheduled nightly runPrompt drift — gradual behaviour change without code changeNightly or weekly
Knowledge base updateRAG quality impact of new/changed documentsOn index update
Production alertDegraded quality detected by monitoring → run full eval to confirm scopeOn alert

The CI/CD Regression Pipeline

Standard regression pipeline:

  1. Load the golden dataset: fixed set of (input, expected_output, context) tuples stored in version control alongside the prompt
  2. Run the application under test: execute the current prompt + retrieval pipeline against every input in the dataset
  3. Score each output: apply evaluation metrics (LLM-as-judge, similarity, contains, schema) to each result
  4. Compare to baseline: compare mean scores to the baseline scores recorded at the last passing release
  5. Gate the deployment: fail the CI job if any metric drops more than the configured threshold (e.g., >5% degradation)
  6. Update baseline on merge: when a PR passes and merges, record its scores as the new baseline

Key principle:

Compare to baseline, not to an absolute threshold. A prompt that scored 82% last release and now scores 78% has regressed — even if 78% is "acceptable" in isolation. Relative degradation is the signal.

Prompt Drift

Prompt drift is the gradual change in LLM output behaviour over time — even when the prompt itself has not changed. It happens because model providers update their models continuously, temperature sampling is non-deterministic, and context changes (retrieved documents, conversation history) accumulate over time.

Causes of prompt drift

  • Model updates: provider silently updates the model behind an API version — behaviour shifts without any code change on your side
  • Retrieval drift: documents in your knowledge base change over time — the same query returns different context, producing different outputs
  • Prompt creep: small unauthorised edits accumulate in the system prompt without going through evaluation
  • Temperature sampling: at temperature > 0, score variance is real — weekly averages can drift without a true regression

Drift detection

  • Run your regression suite weekly even when no code changes — scheduled CI job
  • Track score history over time in a chart — gradual drift shows as a trend, not a sudden drop
  • Use a fixed evaluation model (e.g., always GPT-4o-2024-08-06) — never let the judge model update without re-baselining
  • Pin model versions via API version string where providers support it (OpenAI: gpt-4o-2024-08-06)
  • Alert when 7-day rolling average drops >3% from the 30-day average

Golden Dataset Management

The golden dataset is the foundation of regression testing. Its quality determines whether regressions are caught. A dataset that is too small misses coverage; one that is too large is expensive to run on every PR.

Dataset management rules:

  • Version control the dataset: store alongside the prompt in git — a prompt change and dataset change must be reviewed together
  • Size target: 50–200 examples per prompt for CI (fast enough to run on every PR); 500–1000 for weekly full eval
  • Cover all task types: ensure the dataset samples proportionally from all input categories, not just the easiest ones
  • Include real failure cases: when a regression is found and fixed, add the failing input to the dataset so it never regresses again
  • Refresh quarterly: add new examples reflecting new user behaviour patterns; archive outdated examples
  • Never modify examples to make tests pass: the dataset is ground truth — only the prompt or application changes

Tools for Regression Testing

promptfoo

YAML-driven test runner with built-in CI integration. Define prompts, datasets, and assertions in config; run promptfoo eval in CI. Produces HTML comparison reports and a JSON results file suitable for baseline tracking. Can compare current PR against main branch automatically.

DeepEval + Confident AI

DeepEval runs locally and in CI (GitHub Actions plugin). Confident AI (hosted service) adds regression suite dashboards, baseline tracking, and A/B prompt comparison — it becomes a quality gate that blocks merges if scores drop.

LangSmith + Evidently AI

LangSmith stores evaluation runs and tracks scores over time. Evidently AI provides statistical drift detection — it tests whether score distributions have shifted significantly using statistical tests, distinguishing real drift from sampling noise.

Setting Regression Thresholds

Threshold guidelines:

  • Absolute minimum: block if any metric falls below a floor (e.g., faithfulness <0.7 is always a failure regardless of baseline)
  • Relative regression: block if any metric drops >5% from baseline (sensitive) or >10% (lenient, for early projects)
  • Different thresholds per metric: safety metrics (hallucination rate, refusal accuracy) have tighter thresholds than style metrics (tone, conciseness)
  • Accept improvements: if a change improves scores across the board, update the baseline — do not block improvements
  • Human review for borderline cases: if a change degrades one metric but improves another, route to human reviewer rather than auto-blocking

Checklist: Do You Understand This?

  • What five events should trigger a regression test run, and what does each one catch?
  • Why should regression testing compare to a baseline rather than an absolute threshold?
  • What is prompt drift, and what are the four main causes of it?
  • What size should a golden dataset be for CI regression runs vs weekly full evaluations?
  • Why should the evaluation judge model (LLM-as-judge) be pinned to a fixed version?
  • What should happen when a regression test catches a bug that gets fixed — how does the dataset change?