🧠 All Things AI
Intermediate

Prompt Unit Tests

A prompt that works today may silently break tomorrow — after a model update, a context change, or a subtle prompt edit. Prompt unit tests are the LLM equivalent of software unit tests: automated checks that verify your prompts produce correct, consistent outputs against a known set of inputs. This page covers the testing philosophy, assertion types, tooling, and how to structure a prompt test suite that actually catches regressions.

Why Prompt Testing Is Different

Traditional software unit tests assert exact outputs: assert result == "expected". LLM outputs are probabilistic — the same prompt produces different text on every call. Prompt testing must therefore assert properties of outputs rather than exact strings: does the output contain the right information? Is it the right format? Does it avoid prohibited content? Does it score above a quality threshold when evaluated by another LLM?

What does NOT work

  • Exact string matching — LLM output varies by run, model, temperature
  • Running once and calling it tested — one pass proves nothing about consistency
  • Manual review of every prompt change — does not scale, misses regressions
  • Testing only the happy path — boundary cases and adversarial inputs are where failures hide

What works

  • Property assertions: does output contain X, avoid Y, parse as JSON, match schema?
  • Similarity thresholds: is the output semantically close to the expected answer?
  • LLM-as-judge scoring: does a rubric-evaluating model score this output ≥ N?
  • Running N times and asserting pass rate ≥ threshold (handles stochasticity)
  • Testing across a curated set of diverse inputs, not just one

Five Assertion Types

1. Contains / Not-contains

Assert the output includes a required substring or pattern, or does not include a prohibited one. Cheapest to run — no LLM call needed for evaluation.

Example: JSON response must contain "status" key; must not contain the string "I cannot" in a customer service bot.

2. Schema / Format validation

Assert the output parses correctly as JSON, matches a JSON Schema, or follows a required format (e.g., starts with a list, ends with a summary).

Example: extraction prompt must return { "name": string, "email": string, "score": number } — validate with JSON Schema.

3. Semantic similarity

Assert the output is semantically close to an expected reference answer using embedding cosine similarity. Threshold typically 0.8–0.95 depending on task.

Example: a summarisation prompt must produce output with similarity ≥ 0.85 to the gold summary.

4. LLM-as-judge scoring

A second LLM call evaluates the output against a rubric and returns a score. The test passes if the score meets a threshold. Most expensive but handles complex quality criteria.

Example: "Rate this customer service response for empathy, accuracy, and conciseness on a scale of 1–5. Return JSON." Pass if avg score ≥ 4.0.

5. Custom function

A Python or JS function that inspects the output and returns pass/fail. Used for complex domain-specific checks that cannot be expressed as a pattern or LLM prompt.

Example: SQL generation prompt — run the generated SQL against a test DB and assert it returns the expected row count.

Testing Tools

DeepEval (Confident AI)

Open-source Python framework designed to be pytest for LLMs. Test cases are Python classes with an assert_test() function. Supports 15+ built-in metrics including faithfulness, answer relevancy, contextual precision, and custom LLM-as-judge. Integrates with CI/CD and tracks results over time.

  • Structure: LLMTestCase(input, actual_output, expected_output, context)
  • Metrics: AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, custom
  • CI: GitHub Actions integration; test suite reports on every PR

promptfoo

YAML/JSON-based configuration tool for prompt testing. Define prompts, test cases, and assertions in config files; run with promptfoo eval. Supports 30+ assertion types including contains, json, llm-rubric, similar, and javascript custom functions.

  • No code required for basic tests — configuration-driven
  • Compare prompts side-by-side across multiple models
  • Built-in red team mode: runs adversarial test sets automatically
  • GitHub Actions native — designed for CI/CD integration

Structuring a Prompt Test Suite

Test case categories to cover:

  • Happy path (30%): typical, well-formed inputs where the prompt should clearly succeed
  • Edge cases (30%): empty inputs, very long inputs, inputs in multiple languages, inputs with unusual formatting
  • Boundary cases (20%): inputs right at the limits of what the prompt should handle
  • Adversarial inputs (20%): inputs designed to confuse, jailbreak, or produce incorrect outputs — these are where real failures hide

Minimum viable test suite: 20–50 test cases per prompt. Fewer misses regressions; more creates maintenance burden without proportional coverage gain.

Handling Stochasticity

LLMs are non-deterministic at temperature > 0. A test that passes once may fail on the next run due to sampling variation. Two approaches:

Temperature=0 for deterministic tests

Set temperature to 0 for unit tests that require consistent pass/fail. Output is still not byte-identical across models but is far more consistent. Use this for format, schema, and contains assertions.

Pass-rate threshold for quality tests

For quality metrics tested at temperature > 0, run N samples and assert pass rate ≥ threshold. Example: run 5 times, assert 4 of 5 score ≥ 4.0. Captures reliability, not just peak performance.

What to Always Test

  • Output format — every prompt that requires structured output should have a schema validation test
  • Required content — any field, phrase, or section that must be present
  • Prohibited content — anything the prompt must never generate (competitor names, harmful content, hallucinated citations)
  • Boundary length — the prompt should work for both very short and very long inputs
  • Language fallback — if your prompt is English-only, test that it handles non-English input gracefully (reject or translate, per design)
  • Refusal behaviour — for safety-constrained prompts, test that prohibited requests are actually refused

Checklist: Do You Understand This?

  • Why does exact string matching fail as a prompt test assertion, and what should you assert instead?
  • What are the five assertion types, and which one is cheapest vs most expensive to evaluate?
  • How does LLM-as-judge work as a test assertion, and what does the judge need in its prompt to produce useful scores?
  • What four categories of test cases should a prompt test suite cover, and in roughly what proportions?
  • How do you handle stochasticity in a prompt test that uses temperature > 0?
  • Name three things every prompt that produces structured output should always have tested.