🧠 All Things AI
Intermediate

Chatbot Evaluation

Shipping a chatbot without an evaluation practice is flying blind. Every prompt change, model upgrade, or RAG adjustment can silently degrade quality in ways that only surface when users complain. This page maps the full evaluation stack: which metrics matter for which bot type, how LLM-as-judge works and where it fails, how to build a regression suite that catches problems before deployment, and how to wire it all into CI/CD so evaluation happens automatically on every change.

Why Evaluation Is Hard for LLMs

Traditional software testing checks deterministic outputs: given input X, output must be Y. LLM chatbots break this model:

  • Multiple valid responses exist for most queries — exact-match metrics (BLEU, ROUGE, F1) fail because they assume a single correct answer
  • Quality is multi-dimensional — a response can be factually correct but off-brand, or helpful but unsafe
  • Models behave probabilistically — the same prompt may score differently on repeated runs
  • Errors compound in multi-turn conversations — a single-turn eval misses conversation-level failures
  • Distribution shift happens silently — your test set reflects past queries; user behaviour evolves

The answer is a layered evaluation practice: automated metrics for continuous monitoring, LLM-as-judge for nuanced scoring, human review for calibration, and red-teaming for adversarial robustness — each playing a different role.

Core Evaluation Metrics

The right metrics depend on your bot type. But several dimensions apply across all chatbots:

MetricWhat it measuresHow measured
FaithfulnessIs the response grounded in retrieved context, or hallucinated?LLM-as-judge, RAGAS
Answer relevanceDoes the response actually address the question asked?LLM-as-judge, embedding similarity
Context relevanceDid retrieval surface the right documents for the query?LLM-as-judge, RAGAS
Task completion rateDid the user achieve their goal? (task bots)Deterministic (success/fail), human review
Slot accuracyWere entities (names, dates, amounts) extracted correctly?Deterministic comparison vs. ground truth
Suggestion accept rateWhat fraction of copilot suggestions did users accept?Product analytics
Error rate in accepted suggestionsHow often did an accepted suggestion contain an error?Human review, post-acceptance audit
Containment rateConversations resolved without human escalationProduct analytics
Latency (p50 / p95)Response time — critical for UX (<2–5s target)Infrastructure monitoring
Cost per interactionToken usage × model pricingLLM API usage logs

Metrics by Bot Type

Each chatbot pattern from the FAQ vs Task vs Copilot page has a distinct evaluation profile:

FAQ / RAG Chatbots — maximise groundedness

MetricTargetKey test cases
Faithfulness>90%Queries where context contains the answer
Citation accuracy>85% (baseline: 65–70%)Multi-document synthesis; claims mapped to sources
Out-of-domain rejection>95%Queries on topics absent from knowledge base
Answer relevance>85%Follow-up questions, rephrased queries

Task Bots — maximise completion accuracy

MetricTargetKey test cases
Task completion rate>85%End-to-end happy-path flows
Slot accuracy (by entity type)Dates >95%, names >80%, amounts >90%Boundary values, special characters, ambiguous inputs
Error recoveryBot re-prompts correctly on invalid inputInvalid emails, past dates, out-of-range values
Dialogue efficiencyTurns taken vs. theoretical minimumCompare conversational paths across prompt versions

Copilot Bots — maximise useful suggestions, minimise harmful ones

MetricTargetKey test cases
Suggestion accept rateBaseline → track improvementA/B test suggestion variants
Error rate in accepted suggestions<5%Post-acceptance audit; intentionally tricky contexts
Safety compliance0% policy violationsRed-team: adversarial context injection
Context relevance>80%Suggestions across diverse task states

LLM-as-Judge

LLM-as-judge uses a separate LLM (the "judge") to score your chatbot's outputs against a rubric. When implemented well, it aligns with human judgment at 85%+ — actually higher than human-to-human inter-rater agreement (typically ~81%). This makes it the most scalable high-quality evaluation method available.

Two judging patterns

Point-wise (direct assessment)

Judge evaluates one response against a rubric and returns a score or pass/fail. Use for: post-deployment monitoring, regression testing, A/B prompt comparison.

Pairwise (comparative)

Judge picks the better of two candidate responses. More robust to scoring biases; ideal for A/B testing prompt versions or models head-to-head.

Known biases to mitigate

  • Position bias: judges prefer responses in certain positions (first/last) regardless of quality
  • Length bias: longer responses rated as "more helpful" even when concise answers are better
  • Self-preference: GPT-4 judging GPT-4 output inflates scores — use a different model for judging
  • Prompt sensitivity: small rubric changes produce large score shifts — lock your judge prompt
  • Flakiness: same input may score differently across runs — average multiple calls for high-stakes decisions

Best practices for LLM-as-judge (2025)

  • Write domain-specific rubrics — generic "is this helpful?" prompts miss task-specific failure modes
  • Break complex criteria into yes/no sub-questions — simpler questions produce more consistent judgments
  • Add few-shot examples to the judge prompt — increases GPT-4 consistency from 65% to 77.5%
  • Request chain-of-thought reasoning in the judge output — makes failures debuggable
  • Validate judge quality: have humans label 50–100 examples and measure judge-human agreement — target >80% before trusting at scale
  • Use dedicated judge models (Prometheus variants) or a different model family than your chatbot
  • For critical decisions: multi-agent judging (MAJ-EVAL) — multiple judge agents with different personas debate the score; outperforms single-judge on complex tasks

Human vs Automated Evaluation

DimensionAutomated (metrics + LLM judge)Human evaluation
ScaleThousands of evals per runTens to hundreds (expensive, slow)
SpeedMinutes per eval runDays to weeks per round
Cost$0.10–$2 per run (50–500 cases)High — annotators, coordination, review
NuanceGood with LLM-as-judge; misses subtle tone/policy issuesGold standard — handles ambiguity, edge cases, cultural context
ConsistencyRepeatable given same prompt and model versionSubject to drift, fatigue, interpretation differences
Best roleContinuous monitoring, regression detection, every-PR gateRubric validation, high-stakes decisions, auditing automated evals

The 2025 consensus: run automated evals continuously as a monitoring layer; use human evaluation periodically to calibrate your rubrics and validate that automated metrics still correlate with real user satisfaction. Neither alone is sufficient.

Evaluation Frameworks

ToolFocusStrengthsOpen sourceBest for
RAGASRAG pipelinesReference-free; evaluates retriever + generator independently; easy LangChain/LlamaIndex integrationYesRAG experimentation, fast metric iteration
DeepEvalGeneral LLM appsUnit-test style; native CI/CD; custom metrics; GitHub Actions integration; red-teaming via DeepTeamPartialProduction regression testing across all bot types
LangSmithLangChain observabilityDeep trace visualization; drill into embedding/retrieval/ranking/generation steps; trace-based evalsNoDebugging complex LangChain workflows
PromptfooSecurity & prompt evalsYAML-config (no Python needed); strong red-teaming; CLI-friendly; GitHub ActionsYesSecurity testing, adversarial prompts, quick eval setup
BraintrustEnd-to-end platformAuto-converts prod traces to test cases; regression detection; PR comments; Braintrust GitHub ActionNoTeams wanting plug-and-play prod eval without infra build-out
Arize PhoenixOpen-source observabilityFully open-source; self-hostable; OpenTelemetry standard; no vendor lock-inYesTeams prioritising self-hosting and avoiding proprietary platforms

Regression Testing

A regression suite is a fixed set of test cases you re-run after every prompt change, model update, or RAG adjustment. The goal: catch unintended degradation before it reaches production.

Building your test set

StageGuidance
Minimum size50–100 cases to detect obvious regressions; 200–500 for comprehensive coverage
StratificationCover: simple queries, complex synthesis, edge cases, out-of-domain, multi-turn, adversarial
Ground truthFor deterministic metrics (slot accuracy, task completion), store expected answers; for LLM-judge metrics, store rubric scores from calibration run
Living documentAdd every production failure to the test set immediately — failures are the most valuable test cases
Refresh cadenceQuarterly review: add new query patterns, remove stale cases, rebalance category distribution

Regression thresholds (typical starting points)

  • Faithfulness: fail build if drop >3–5% vs. baseline
  • Task completion rate: fail if drop >2%
  • Latency (p95): fail if increase >20%
  • Cost per interaction: warn if increase >30%
  • Safety/hallucination: any regression is a hard fail — no exceptions
  • Decision rule: small wins in one metric never justify regressions in safety or accuracy

CI/CD Integration

The goal: every PR that touches a prompt, system instruction, model version, or RAG configuration automatically runs the eval suite and posts results as a PR comment before merge is allowed.

Typical CI/CD eval pipeline
PR submitted (prompt / model / RAG change)
GitHub Action triggered → load test dataset (50–500 cases)
Run chatbot with new version against all test cases
Score outputs (deterministic metrics + LLM-as-judge)
Compare vs baseline (previous production version)
Post PR comment: metrics by category, deltas, regressions flagged
PASS: all metrics above thresholds → merge allowed
FAIL: regression detected → requires review + justification

Practical economics: a 100-case eval run using GPT-4o as judge costs roughly $0.50–$1.00 and completes in 2–5 minutes. Running twice daily costs under $2 — trivially cheap compared to the cost of a quality regression reaching production users.

Tiered runs: run a fast 20–50 case eval on every PR (30 seconds, <$0.25); run a comprehensive 500-case eval on the main branch before deploy. Keep CI fast by reserving thorough evals for merge gates, not every commit.

Red-Teaming

Red-teaming stress-tests your chatbot with adversarial inputs before deployment. As of 2025, the EU AI Act and U.S. executive orders explicitly require adversarial testing for high-risk AI systems — red-teaming is becoming a compliance requirement, not optional.

Attack taxonomy

Prompt injection: user embeds instructions designed to override your system prompt
Jailbreaking: crafted inputs that bypass safety guidelines
Hallucination triggers: prompts designed to elicit false confident claims
Context confusion: multi-turn attacks exploiting memory mismanagement
Toxicity/bias triggers: inputs designed to elicit harmful or biased language
Data leakage: attempts to extract system prompt or training data
Omission attacks: requests the bot should refuse but accepts
Topic drift: gradually steering the bot off its designated scope

2025 research context

The PRISM automated red-teaming framework (August 2025) achieved a 100% attack success rate against 37 of 41 state-of-the-art LLMs in multi-turn adversarial dialogue scenarios. Even frontier models remain vulnerable to systematic attacks. This underlines why red-teaming must be a continuous practice, not a one-off pre-launch activity.

Tools for automated red-teaming:

  • Promptfoo: built-in red-team mode via YAML config — lowest friction to get started
  • DeepTeam (Confident AI): automated attack generation with categorised failure reports
  • Giskard: dynamic multi-turn stress tests; detects hallucinations, omissions, prompt injections, data leakage

Common Mistakes

Metric mistakes

  • Using BLEU / ROUGE: these assume a single correct answer; they fail for open-ended LLM outputs. Use semantic metrics or LLM-as-judge instead
  • Domain-irrelevant benchmarks: MMLU performance doesn't predict domain-specific helpfulness. Always eval on representative task data
  • Aggregate scores hiding slice failures: "85% overall" can mean 30% failure on a critical query category. Always stratify results

Process mistakes

  • No regression testing: evaluating only new changes without a baseline means you cannot detect drift
  • Evals not connected to deployment: results that don't gate merges get ignored. Implement quality gates
  • Single-turn only: errors compound in multi-turn conversations. Build multi-turn test cases that check context retention and consistency
  • Same model as judge: GPT-4 judging its own outputs inflates scores — use a different model or model family
  • No post-deploy monitoring: metrics validated in test can drift in production as query distribution shifts

Minimum Viable Eval Setup

Start here before building anything more complex:

  1. Build a test set: 100 representative queries covering your bot's use cases, edge cases, and known failure modes
  2. Define 3–5 critical metrics for your bot type (faithfulness + answer relevance + latency for FAQ bots; task completion + slot accuracy for task bots)
  3. Implement LLM-as-judge scoring with a domain-specific rubric + few-shot examples; validate against 50 human-labeled cases
  4. Run the full suite before every production deploy — fail the deploy if any metric regresses beyond threshold
  5. Sample 20–50 real production conversations weekly; review failures and add them to the test set

This five-step setup catches the majority of regressions for under $5/day in eval costs. Add red-teaming, multi-agent judging, and comprehensive CI/CD integration as your system matures.

Checklist: Do You Understand This?

  • Can you name two reasons why BLEU / ROUGE are poor metrics for LLM chatbot evaluation?
  • Do you know which three metrics matter most for a RAG/FAQ bot vs a task bot vs a copilot?
  • Can you explain the position bias and length bias problems with LLM-as-judge, and how to mitigate them?
  • Do you understand why human evaluation and automated evaluation are complements, not substitutes?
  • Can you describe what a regression suite is and what failure thresholds you would set for faithfulness and safety?
  • Do you know the difference between RAGAS, DeepEval, and Promptfoo — and when you would reach for each?
  • Can you describe the 2025 PRISM finding and what it implies about red-teaming cadence?