Regression Testing
An LLM application can break without anyone touching the code. Provider model updates, changes to retrieved context, prompt edits by a non-engineer, or seasonal data drift can all silently degrade quality. Regression testing is the practice of running a fixed evaluation suite on every change — and blocking deployment when quality drops. This page covers the regression testing pipeline, what triggers a regression run, and how to detect prompt drift in production.
What Triggers a Regression Run
| Trigger | What it catches | How often |
|---|---|---|
| PR / code push | Prompt changes, tool changes, retrieval changes | Every change |
| Model version update | Provider-side behaviour changes, capability shifts | On provider announcement |
| Scheduled nightly run | Prompt drift — gradual behaviour change without code change | Nightly or weekly |
| Knowledge base update | RAG quality impact of new/changed documents | On index update |
| Production alert | Degraded quality detected by monitoring → run full eval to confirm scope | On alert |
The CI/CD Regression Pipeline
Standard regression pipeline:
- Load the golden dataset: fixed set of (input, expected_output, context) tuples stored in version control alongside the prompt
- Run the application under test: execute the current prompt + retrieval pipeline against every input in the dataset
- Score each output: apply evaluation metrics (LLM-as-judge, similarity, contains, schema) to each result
- Compare to baseline: compare mean scores to the baseline scores recorded at the last passing release
- Gate the deployment: fail the CI job if any metric drops more than the configured threshold (e.g., >5% degradation)
- Update baseline on merge: when a PR passes and merges, record its scores as the new baseline
Key principle:
Compare to baseline, not to an absolute threshold. A prompt that scored 82% last release and now scores 78% has regressed — even if 78% is "acceptable" in isolation. Relative degradation is the signal.
Prompt Drift
Prompt drift is the gradual change in LLM output behaviour over time — even when the prompt itself has not changed. It happens because model providers update their models continuously, temperature sampling is non-deterministic, and context changes (retrieved documents, conversation history) accumulate over time.
Causes of prompt drift
- Model updates: provider silently updates the model behind an API version — behaviour shifts without any code change on your side
- Retrieval drift: documents in your knowledge base change over time — the same query returns different context, producing different outputs
- Prompt creep: small unauthorised edits accumulate in the system prompt without going through evaluation
- Temperature sampling: at temperature > 0, score variance is real — weekly averages can drift without a true regression
Drift detection
- Run your regression suite weekly even when no code changes — scheduled CI job
- Track score history over time in a chart — gradual drift shows as a trend, not a sudden drop
- Use a fixed evaluation model (e.g., always GPT-4o-2024-08-06) — never let the judge model update without re-baselining
- Pin model versions via API version string where providers support it (OpenAI:
gpt-4o-2024-08-06) - Alert when 7-day rolling average drops >3% from the 30-day average
Golden Dataset Management
The golden dataset is the foundation of regression testing. Its quality determines whether regressions are caught. A dataset that is too small misses coverage; one that is too large is expensive to run on every PR.
Dataset management rules:
- Version control the dataset: store alongside the prompt in git — a prompt change and dataset change must be reviewed together
- Size target: 50–200 examples per prompt for CI (fast enough to run on every PR); 500–1000 for weekly full eval
- Cover all task types: ensure the dataset samples proportionally from all input categories, not just the easiest ones
- Include real failure cases: when a regression is found and fixed, add the failing input to the dataset so it never regresses again
- Refresh quarterly: add new examples reflecting new user behaviour patterns; archive outdated examples
- Never modify examples to make tests pass: the dataset is ground truth — only the prompt or application changes
Tools for Regression Testing
promptfoo
YAML-driven test runner with built-in CI integration. Define prompts, datasets, and assertions in config; run promptfoo eval in CI. Produces HTML comparison reports and a JSON results file suitable for baseline tracking. Can compare current PR against main branch automatically.
DeepEval + Confident AI
DeepEval runs locally and in CI (GitHub Actions plugin). Confident AI (hosted service) adds regression suite dashboards, baseline tracking, and A/B prompt comparison — it becomes a quality gate that blocks merges if scores drop.
LangSmith + Evidently AI
LangSmith stores evaluation runs and tracks scores over time. Evidently AI provides statistical drift detection — it tests whether score distributions have shifted significantly using statistical tests, distinguishing real drift from sampling noise.
Setting Regression Thresholds
Threshold guidelines:
- Absolute minimum: block if any metric falls below a floor (e.g., faithfulness <0.7 is always a failure regardless of baseline)
- Relative regression: block if any metric drops >5% from baseline (sensitive) or >10% (lenient, for early projects)
- Different thresholds per metric: safety metrics (hallucination rate, refusal accuracy) have tighter thresholds than style metrics (tone, conciseness)
- Accept improvements: if a change improves scores across the board, update the baseline — do not block improvements
- Human review for borderline cases: if a change degrades one metric but improves another, route to human reviewer rather than auto-blocking
Checklist: Do You Understand This?
- What five events should trigger a regression test run, and what does each one catch?
- Why should regression testing compare to a baseline rather than an absolute threshold?
- What is prompt drift, and what are the four main causes of it?
- What size should a golden dataset be for CI regression runs vs weekly full evaluations?
- Why should the evaluation judge model (LLM-as-judge) be pinned to a fixed version?
- What should happen when a regression test catches a bug that gets fixed — how does the dataset change?