Prompt Unit Tests
A prompt that works today may silently break tomorrow — after a model update, a context change, or a subtle prompt edit. Prompt unit tests are the LLM equivalent of software unit tests: automated checks that verify your prompts produce correct, consistent outputs against a known set of inputs. This page covers the testing philosophy, assertion types, tooling, and how to structure a prompt test suite that actually catches regressions.
Why Prompt Testing Is Different
Traditional software unit tests assert exact outputs: assert result == "expected". LLM outputs are probabilistic — the same prompt produces different text on every call. Prompt testing must therefore assert properties of outputs rather than exact strings: does the output contain the right information? Is it the right format? Does it avoid prohibited content? Does it score above a quality threshold when evaluated by another LLM?
What does NOT work
- Exact string matching — LLM output varies by run, model, temperature
- Running once and calling it tested — one pass proves nothing about consistency
- Manual review of every prompt change — does not scale, misses regressions
- Testing only the happy path — boundary cases and adversarial inputs are where failures hide
What works
- Property assertions: does output contain X, avoid Y, parse as JSON, match schema?
- Similarity thresholds: is the output semantically close to the expected answer?
- LLM-as-judge scoring: does a rubric-evaluating model score this output ≥ N?
- Running N times and asserting pass rate ≥ threshold (handles stochasticity)
- Testing across a curated set of diverse inputs, not just one
Five Assertion Types
1. Contains / Not-contains
Assert the output includes a required substring or pattern, or does not include a prohibited one. Cheapest to run — no LLM call needed for evaluation.
Example: JSON response must contain "status" key; must not contain the string "I cannot" in a customer service bot.
2. Schema / Format validation
Assert the output parses correctly as JSON, matches a JSON Schema, or follows a required format (e.g., starts with a list, ends with a summary).
Example: extraction prompt must return { "name": string, "email": string, "score": number } — validate with JSON Schema.
3. Semantic similarity
Assert the output is semantically close to an expected reference answer using embedding cosine similarity. Threshold typically 0.8–0.95 depending on task.
Example: a summarisation prompt must produce output with similarity ≥ 0.85 to the gold summary.
4. LLM-as-judge scoring
A second LLM call evaluates the output against a rubric and returns a score. The test passes if the score meets a threshold. Most expensive but handles complex quality criteria.
Example: "Rate this customer service response for empathy, accuracy, and conciseness on a scale of 1–5. Return JSON." Pass if avg score ≥ 4.0.
5. Custom function
A Python or JS function that inspects the output and returns pass/fail. Used for complex domain-specific checks that cannot be expressed as a pattern or LLM prompt.
Example: SQL generation prompt — run the generated SQL against a test DB and assert it returns the expected row count.
Testing Tools
DeepEval (Confident AI)
Open-source Python framework designed to be pytest for LLMs. Test cases are Python classes with an assert_test() function. Supports 15+ built-in metrics including faithfulness, answer relevancy, contextual precision, and custom LLM-as-judge. Integrates with CI/CD and tracks results over time.
- Structure:
LLMTestCase(input, actual_output, expected_output, context) - Metrics:
AnswerRelevancyMetric,FaithfulnessMetric,HallucinationMetric, custom - CI: GitHub Actions integration; test suite reports on every PR
promptfoo
YAML/JSON-based configuration tool for prompt testing. Define prompts, test cases, and assertions in config files; run with promptfoo eval. Supports 30+ assertion types including contains, json, llm-rubric, similar, and javascript custom functions.
- No code required for basic tests — configuration-driven
- Compare prompts side-by-side across multiple models
- Built-in red team mode: runs adversarial test sets automatically
- GitHub Actions native — designed for CI/CD integration
Structuring a Prompt Test Suite
Test case categories to cover:
- Happy path (30%): typical, well-formed inputs where the prompt should clearly succeed
- Edge cases (30%): empty inputs, very long inputs, inputs in multiple languages, inputs with unusual formatting
- Boundary cases (20%): inputs right at the limits of what the prompt should handle
- Adversarial inputs (20%): inputs designed to confuse, jailbreak, or produce incorrect outputs — these are where real failures hide
Minimum viable test suite: 20–50 test cases per prompt. Fewer misses regressions; more creates maintenance burden without proportional coverage gain.
Handling Stochasticity
LLMs are non-deterministic at temperature > 0. A test that passes once may fail on the next run due to sampling variation. Two approaches:
Temperature=0 for deterministic tests
Set temperature to 0 for unit tests that require consistent pass/fail. Output is still not byte-identical across models but is far more consistent. Use this for format, schema, and contains assertions.
Pass-rate threshold for quality tests
For quality metrics tested at temperature > 0, run N samples and assert pass rate ≥ threshold. Example: run 5 times, assert 4 of 5 score ≥ 4.0. Captures reliability, not just peak performance.
What to Always Test
- Output format — every prompt that requires structured output should have a schema validation test
- Required content — any field, phrase, or section that must be present
- Prohibited content — anything the prompt must never generate (competitor names, harmful content, hallucinated citations)
- Boundary length — the prompt should work for both very short and very long inputs
- Language fallback — if your prompt is English-only, test that it handles non-English input gracefully (reject or translate, per design)
- Refusal behaviour — for safety-constrained prompts, test that prohibited requests are actually refused
Checklist: Do You Understand This?
- Why does exact string matching fail as a prompt test assertion, and what should you assert instead?
- What are the five assertion types, and which one is cheapest vs most expensive to evaluate?
- How does LLM-as-judge work as a test assertion, and what does the judge need in its prompt to produce useful scores?
- What four categories of test cases should a prompt test suite cover, and in roughly what proportions?
- How do you handle stochasticity in a prompt test that uses temperature > 0?
- Name three things every prompt that produces structured output should always have tested.