Intermediate

Prompt Unit Tests

A prompt that works today may silently break tomorrow — after a model update, a context change, or a subtle prompt edit. Prompt unit tests are the LLM equivalent of software unit tests: automated checks that verify your prompts produce correct, consistent outputs against a known set of inputs. This page covers the testing philosophy, assertion types, tooling, and how to structure a prompt test suite that actually catches regressions.

Test Cases

Input + expected output

→

Run Prompt

Against LLM

→

Assert

Contains / JSON / regex

→

Grade

Pass / fail / score

→

Fix Prompt

Iterate until green

Prompt test loop — treat prompts as code with a test suite

Why Prompt Testing Is Different

Traditional software unit tests assert exact outputs: assert result == "expected". LLM outputs are probabilistic — the same prompt produces different text on every call. Prompt testing must therefore assert properties of outputs rather than exact strings: does the output contain the right information? Is it the right format? Does it avoid prohibited content? Does it score above a quality threshold when evaluated by another LLM?

What does NOT work

Exact string matching — LLM output varies by run, model, temperature
Running once and calling it tested — one pass proves nothing about consistency
Manual review of every prompt change — does not scale, misses regressions
Testing only the happy path — boundary cases and adversarial inputs are where failures hide

What works

Property assertions: does output contain X, avoid Y, parse as JSON, match schema?
Similarity thresholds: is the output semantically close to the expected answer?
LLM-as-judge scoring: does a rubric-evaluating model score this output ≥ N?
Running N times and asserting pass rate ≥ threshold (handles stochasticity)
Testing across a curated set of diverse inputs, not just one

Five Assertion Types

1. Contains / Not-contains

Assert the output includes a required substring or pattern, or does not include a prohibited one. Cheapest to run — no LLM call needed for evaluation.

Example: JSON response must contain "status" key; must not contain the string "I cannot" in a customer service bot.

2. Schema / Format validation

Assert the output parses correctly as JSON, matches a JSON Schema, or follows a required format (e.g., starts with a list, ends with a summary).

Example: extraction prompt must return { "name": string, "email": string, "score": number } — validate with JSON Schema.

3. Semantic similarity

Assert the output is semantically close to an expected reference answer using embedding cosine similarity. Threshold typically 0.8–0.95 depending on task.

Example: a summarisation prompt must produce output with similarity ≥ 0.85 to the gold summary.

4. LLM-as-judge scoring

A second LLM call evaluates the output against a rubric and returns a score. The test passes if the score meets a threshold. Most expensive but handles complex quality criteria.

Example: "Rate this customer service response for empathy, accuracy, and conciseness on a scale of 1–5. Return JSON." Pass if avg score ≥ 4.0.

5. Custom function

A Python or JS function that inspects the output and returns pass/fail. Used for complex domain-specific checks that cannot be expressed as a pattern or LLM prompt.

Example: SQL generation prompt — run the generated SQL against a test DB and assert it returns the expected row count.

Testing Tools

DeepEval (Confident AI)

Open-source Python framework designed to be pytest for LLMs. Test cases are Python classes with an assert_test() function. Supports 15+ built-in metrics including faithfulness, answer relevancy, contextual precision, and custom LLM-as-judge. Integrates with CI/CD and tracks results over time.

Structure: LLMTestCase(input, actual_output, expected_output, context)
Metrics: AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, custom
CI: GitHub Actions integration; test suite reports on every PR

promptfoo

YAML/JSON-based configuration tool for prompt testing. Define prompts, test cases, and assertions in config files; run with promptfoo eval. Supports 30+ assertion types including contains, json, llm-rubric, similar, and javascript custom functions.

No code required for basic tests — configuration-driven
Compare prompts side-by-side across multiple models
Built-in red team mode: runs adversarial test sets automatically
GitHub Actions native — designed for CI/CD integration

Structuring a Prompt Test Suite

Test case categories to cover:

Happy path (30%): typical, well-formed inputs where the prompt should clearly succeed
Edge cases (30%): empty inputs, very long inputs, inputs in multiple languages, inputs with unusual formatting
Boundary cases (20%): inputs right at the limits of what the prompt should handle
Adversarial inputs (20%): inputs designed to confuse, jailbreak, or produce incorrect outputs — these are where real failures hide

Minimum viable test suite: 20–50 test cases per prompt. Fewer misses regressions; more creates maintenance burden without proportional coverage gain.

Handling Stochasticity

LLMs are non-deterministic at temperature > 0. A test that passes once may fail on the next run due to sampling variation. Two approaches:

Temperature=0 for deterministic tests

Set temperature to 0 for unit tests that require consistent pass/fail. Output is still not byte-identical across models but is far more consistent. Use this for format, schema, and contains assertions.

Pass-rate threshold for quality tests

For quality metrics tested at temperature > 0, run N samples and assert pass rate ≥ threshold. Example: run 5 times, assert 4 of 5 score ≥ 4.0. Captures reliability, not just peak performance.

What to Always Test

Output format — every prompt that requires structured output should have a schema validation test
Required content — any field, phrase, or section that must be present
Prohibited content — anything the prompt must never generate (competitor names, harmful content, hallucinated citations)
Boundary length — the prompt should work for both very short and very long inputs
Language fallback — if your prompt is English-only, test that it handles non-English input gracefully (reject or translate, per design)
Refusal behaviour — for safety-constrained prompts, test that prohibited requests are actually refused

Checklist: Do You Understand This?

Why does exact string matching fail as a prompt test assertion, and what should you assert instead?
What are the five assertion types, and which one is cheapest vs most expensive to evaluate?
How does LLM-as-judge work as a test assertion, and what does the judge need in its prompt to produce useful scores?
What four categories of test cases should a prompt test suite cover, and in roughly what proportions?
How do you handle stochasticity in a prompt test that uses temperature > 0?
Name three things every prompt that produces structured output should always have tested.