Intermediate

Regression Testing

An LLM application can break without anyone touching the code. Provider model updates, changes to retrieved context, prompt edits by a non-engineer, or seasonal data drift can all silently degrade quality. Regression testing is the practice of running a fixed evaluation suite on every change — and blocking deployment when quality drops. This page covers the regression testing pipeline, what triggers a regression run, and how to detect prompt drift in production.

What Triggers a Regression Run

Trigger	What it catches	How often
PR / code push	Prompt changes, tool changes, retrieval changes	Every change
Model version update	Provider-side behaviour changes, capability shifts	On provider announcement
Scheduled nightly run	Prompt drift — gradual behaviour change without code change	Nightly or weekly
Knowledge base update	RAG quality impact of new/changed documents	On index update
Production alert	Degraded quality detected by monitoring → run full eval to confirm scope	On alert

The CI/CD Regression Pipeline

Standard regression pipeline:

Load the golden dataset: fixed set of (input, expected_output, context) tuples stored in version control alongside the prompt
Run the application under test: execute the current prompt + retrieval pipeline against every input in the dataset
Score each output: apply evaluation metrics (LLM-as-judge, similarity, contains, schema) to each result
Compare to baseline: compare mean scores to the baseline scores recorded at the last passing release
Gate the deployment: fail the CI job if any metric drops more than the configured threshold (e.g., >5% degradation)
Update baseline on merge: when a PR passes and merges, record its scores as the new baseline

Key principle:

Compare to baseline, not to an absolute threshold. A prompt that scored 82% last release and now scores 78% has regressed — even if 78% is "acceptable" in isolation. Relative degradation is the signal.

Prompt Drift

Prompt drift is the gradual change in LLM output behaviour over time — even when the prompt itself has not changed. It happens because model providers update their models continuously, temperature sampling is non-deterministic, and context changes (retrieved documents, conversation history) accumulate over time.

Causes of prompt drift

Model updates: provider silently updates the model behind an API version — behaviour shifts without any code change on your side
Retrieval drift: documents in your knowledge base change over time — the same query returns different context, producing different outputs
Prompt creep: small unauthorised edits accumulate in the system prompt without going through evaluation
Temperature sampling: at temperature > 0, score variance is real — weekly averages can drift without a true regression

Drift detection

Run your regression suite weekly even when no code changes — scheduled CI job
Track score history over time in a chart — gradual drift shows as a trend, not a sudden drop
Use a fixed evaluation model (e.g., always GPT-4o-2024-08-06) — never let the judge model update without re-baselining
Pin model versions via API version string where providers support it (OpenAI: gpt-4o-2024-08-06)
Alert when 7-day rolling average drops >3% from the 30-day average

Golden Dataset Management

The golden dataset is the foundation of regression testing. Its quality determines whether regressions are caught. A dataset that is too small misses coverage; one that is too large is expensive to run on every PR.

Dataset management rules:

Version control the dataset: store alongside the prompt in git — a prompt change and dataset change must be reviewed together
Size target: 50–200 examples per prompt for CI (fast enough to run on every PR); 500–1000 for weekly full eval
Cover all task types: ensure the dataset samples proportionally from all input categories, not just the easiest ones
Include real failure cases: when a regression is found and fixed, add the failing input to the dataset so it never regresses again
Refresh quarterly: add new examples reflecting new user behaviour patterns; archive outdated examples
Never modify examples to make tests pass: the dataset is ground truth — only the prompt or application changes

Tools for Regression Testing

promptfoo

YAML-driven test runner with built-in CI integration. Define prompts, datasets, and assertions in config; run promptfoo eval in CI. Produces HTML comparison reports and a JSON results file suitable for baseline tracking. Can compare current PR against main branch automatically.

DeepEval + Confident AI

DeepEval runs locally and in CI (GitHub Actions plugin). Confident AI (hosted service) adds regression suite dashboards, baseline tracking, and A/B prompt comparison — it becomes a quality gate that blocks merges if scores drop.

LangSmith + Evidently AI

LangSmith stores evaluation runs and tracks scores over time. Evidently AI provides statistical drift detection — it tests whether score distributions have shifted significantly using statistical tests, distinguishing real drift from sampling noise.

Setting Regression Thresholds

Threshold guidelines:

Absolute minimum: block if any metric falls below a floor (e.g., faithfulness <0.7 is always a failure regardless of baseline)
Relative regression: block if any metric drops >5% from baseline (sensitive) or >10% (lenient, for early projects)
Different thresholds per metric: safety metrics (hallucination rate, refusal accuracy) have tighter thresholds than style metrics (tone, conciseness)
Accept improvements: if a change improves scores across the board, update the baseline — do not block improvements
Human review for borderline cases: if a change degrades one metric but improves another, route to human reviewer rather than auto-blocking

Checklist: Do You Understand This?

What five events should trigger a regression test run, and what does each one catch?
Why should regression testing compare to a baseline rather than an absolute threshold?
What is prompt drift, and what are the four main causes of it?
What size should a golden dataset be for CI regression runs vs weekly full evaluations?
Why should the evaluation judge model (LLM-as-judge) be pinned to a fixed version?
What should happen when a regression test catches a bug that gets fixed — how does the dataset change?