Chatbot Evaluation
Shipping a chatbot without an evaluation practice is flying blind. Every prompt change, model upgrade, or RAG adjustment can silently degrade quality in ways that only surface when users complain. This page maps the full evaluation stack: which metrics matter for which bot type, how LLM-as-judge works and where it fails, how to build a regression suite that catches problems before deployment, and how to wire it all into CI/CD so evaluation happens automatically on every change.
Why Evaluation Is Hard for LLMs
Traditional software testing checks deterministic outputs: given input X, output must be Y. LLM chatbots break this model:
- Multiple valid responses exist for most queries — exact-match metrics (BLEU, ROUGE, F1) fail because they assume a single correct answer
- Quality is multi-dimensional — a response can be factually correct but off-brand, or helpful but unsafe
- Models behave probabilistically — the same prompt may score differently on repeated runs
- Errors compound in multi-turn conversations — a single-turn eval misses conversation-level failures
- Distribution shift happens silently — your test set reflects past queries; user behaviour evolves
The answer is a layered evaluation practice: automated metrics for continuous monitoring, LLM-as-judge for nuanced scoring, human review for calibration, and red-teaming for adversarial robustness — each playing a different role.
Core Evaluation Metrics
The right metrics depend on your bot type. But several dimensions apply across all chatbots:
| Metric | What it measures | How measured |
|---|---|---|
| Faithfulness | Is the response grounded in retrieved context, or hallucinated? | LLM-as-judge, RAGAS |
| Answer relevance | Does the response actually address the question asked? | LLM-as-judge, embedding similarity |
| Context relevance | Did retrieval surface the right documents for the query? | LLM-as-judge, RAGAS |
| Task completion rate | Did the user achieve their goal? (task bots) | Deterministic (success/fail), human review |
| Slot accuracy | Were entities (names, dates, amounts) extracted correctly? | Deterministic comparison vs. ground truth |
| Suggestion accept rate | What fraction of copilot suggestions did users accept? | Product analytics |
| Error rate in accepted suggestions | How often did an accepted suggestion contain an error? | Human review, post-acceptance audit |
| Containment rate | Conversations resolved without human escalation | Product analytics |
| Latency (p50 / p95) | Response time — critical for UX (<2–5s target) | Infrastructure monitoring |
| Cost per interaction | Token usage × model pricing | LLM API usage logs |
Metrics by Bot Type
Each chatbot pattern from the FAQ vs Task vs Copilot page has a distinct evaluation profile:
FAQ / RAG Chatbots — maximise groundedness
| Metric | Target | Key test cases |
|---|---|---|
| Faithfulness | >90% | Queries where context contains the answer |
| Citation accuracy | >85% (baseline: 65–70%) | Multi-document synthesis; claims mapped to sources |
| Out-of-domain rejection | >95% | Queries on topics absent from knowledge base |
| Answer relevance | >85% | Follow-up questions, rephrased queries |
Task Bots — maximise completion accuracy
| Metric | Target | Key test cases |
|---|---|---|
| Task completion rate | >85% | End-to-end happy-path flows |
| Slot accuracy (by entity type) | Dates >95%, names >80%, amounts >90% | Boundary values, special characters, ambiguous inputs |
| Error recovery | Bot re-prompts correctly on invalid input | Invalid emails, past dates, out-of-range values |
| Dialogue efficiency | Turns taken vs. theoretical minimum | Compare conversational paths across prompt versions |
Copilot Bots — maximise useful suggestions, minimise harmful ones
| Metric | Target | Key test cases |
|---|---|---|
| Suggestion accept rate | Baseline → track improvement | A/B test suggestion variants |
| Error rate in accepted suggestions | <5% | Post-acceptance audit; intentionally tricky contexts |
| Safety compliance | 0% policy violations | Red-team: adversarial context injection |
| Context relevance | >80% | Suggestions across diverse task states |
LLM-as-Judge
LLM-as-judge uses a separate LLM (the "judge") to score your chatbot's outputs against a rubric. When implemented well, it aligns with human judgment at 85%+ — actually higher than human-to-human inter-rater agreement (typically ~81%). This makes it the most scalable high-quality evaluation method available.
Two judging patterns
Point-wise (direct assessment)
Judge evaluates one response against a rubric and returns a score or pass/fail. Use for: post-deployment monitoring, regression testing, A/B prompt comparison.
Pairwise (comparative)
Judge picks the better of two candidate responses. More robust to scoring biases; ideal for A/B testing prompt versions or models head-to-head.
Known biases to mitigate
- Position bias: judges prefer responses in certain positions (first/last) regardless of quality
- Length bias: longer responses rated as "more helpful" even when concise answers are better
- Self-preference: GPT-4 judging GPT-4 output inflates scores — use a different model for judging
- Prompt sensitivity: small rubric changes produce large score shifts — lock your judge prompt
- Flakiness: same input may score differently across runs — average multiple calls for high-stakes decisions
Best practices for LLM-as-judge (2025)
- Write domain-specific rubrics — generic "is this helpful?" prompts miss task-specific failure modes
- Break complex criteria into yes/no sub-questions — simpler questions produce more consistent judgments
- Add few-shot examples to the judge prompt — increases GPT-4 consistency from 65% to 77.5%
- Request chain-of-thought reasoning in the judge output — makes failures debuggable
- Validate judge quality: have humans label 50–100 examples and measure judge-human agreement — target >80% before trusting at scale
- Use dedicated judge models (Prometheus variants) or a different model family than your chatbot
- For critical decisions: multi-agent judging (MAJ-EVAL) — multiple judge agents with different personas debate the score; outperforms single-judge on complex tasks
Human vs Automated Evaluation
| Dimension | Automated (metrics + LLM judge) | Human evaluation |
|---|---|---|
| Scale | Thousands of evals per run | Tens to hundreds (expensive, slow) |
| Speed | Minutes per eval run | Days to weeks per round |
| Cost | $0.10–$2 per run (50–500 cases) | High — annotators, coordination, review |
| Nuance | Good with LLM-as-judge; misses subtle tone/policy issues | Gold standard — handles ambiguity, edge cases, cultural context |
| Consistency | Repeatable given same prompt and model version | Subject to drift, fatigue, interpretation differences |
| Best role | Continuous monitoring, regression detection, every-PR gate | Rubric validation, high-stakes decisions, auditing automated evals |
The 2025 consensus: run automated evals continuously as a monitoring layer; use human evaluation periodically to calibrate your rubrics and validate that automated metrics still correlate with real user satisfaction. Neither alone is sufficient.
Evaluation Frameworks
| Tool | Focus | Strengths | Open source | Best for |
|---|---|---|---|---|
| RAGAS | RAG pipelines | Reference-free; evaluates retriever + generator independently; easy LangChain/LlamaIndex integration | Yes | RAG experimentation, fast metric iteration |
| DeepEval | General LLM apps | Unit-test style; native CI/CD; custom metrics; GitHub Actions integration; red-teaming via DeepTeam | Partial | Production regression testing across all bot types |
| LangSmith | LangChain observability | Deep trace visualization; drill into embedding/retrieval/ranking/generation steps; trace-based evals | No | Debugging complex LangChain workflows |
| Promptfoo | Security & prompt evals | YAML-config (no Python needed); strong red-teaming; CLI-friendly; GitHub Actions | Yes | Security testing, adversarial prompts, quick eval setup |
| Braintrust | End-to-end platform | Auto-converts prod traces to test cases; regression detection; PR comments; Braintrust GitHub Action | No | Teams wanting plug-and-play prod eval without infra build-out |
| Arize Phoenix | Open-source observability | Fully open-source; self-hostable; OpenTelemetry standard; no vendor lock-in | Yes | Teams prioritising self-hosting and avoiding proprietary platforms |
Regression Testing
A regression suite is a fixed set of test cases you re-run after every prompt change, model update, or RAG adjustment. The goal: catch unintended degradation before it reaches production.
Building your test set
| Stage | Guidance |
|---|---|
| Minimum size | 50–100 cases to detect obvious regressions; 200–500 for comprehensive coverage |
| Stratification | Cover: simple queries, complex synthesis, edge cases, out-of-domain, multi-turn, adversarial |
| Ground truth | For deterministic metrics (slot accuracy, task completion), store expected answers; for LLM-judge metrics, store rubric scores from calibration run |
| Living document | Add every production failure to the test set immediately — failures are the most valuable test cases |
| Refresh cadence | Quarterly review: add new query patterns, remove stale cases, rebalance category distribution |
Regression thresholds (typical starting points)
- Faithfulness: fail build if drop >3–5% vs. baseline
- Task completion rate: fail if drop >2%
- Latency (p95): fail if increase >20%
- Cost per interaction: warn if increase >30%
- Safety/hallucination: any regression is a hard fail — no exceptions
- Decision rule: small wins in one metric never justify regressions in safety or accuracy
CI/CD Integration
The goal: every PR that touches a prompt, system instruction, model version, or RAG configuration automatically runs the eval suite and posts results as a PR comment before merge is allowed.
Practical economics: a 100-case eval run using GPT-4o as judge costs roughly $0.50–$1.00 and completes in 2–5 minutes. Running twice daily costs under $2 — trivially cheap compared to the cost of a quality regression reaching production users.
Tiered runs: run a fast 20–50 case eval on every PR (30 seconds, <$0.25); run a comprehensive 500-case eval on the main branch before deploy. Keep CI fast by reserving thorough evals for merge gates, not every commit.
Red-Teaming
Red-teaming stress-tests your chatbot with adversarial inputs before deployment. As of 2025, the EU AI Act and U.S. executive orders explicitly require adversarial testing for high-risk AI systems — red-teaming is becoming a compliance requirement, not optional.
Attack taxonomy
2025 research context
The PRISM automated red-teaming framework (August 2025) achieved a 100% attack success rate against 37 of 41 state-of-the-art LLMs in multi-turn adversarial dialogue scenarios. Even frontier models remain vulnerable to systematic attacks. This underlines why red-teaming must be a continuous practice, not a one-off pre-launch activity.
Tools for automated red-teaming:
- Promptfoo: built-in red-team mode via YAML config — lowest friction to get started
- DeepTeam (Confident AI): automated attack generation with categorised failure reports
- Giskard: dynamic multi-turn stress tests; detects hallucinations, omissions, prompt injections, data leakage
Common Mistakes
Metric mistakes
- Using BLEU / ROUGE: these assume a single correct answer; they fail for open-ended LLM outputs. Use semantic metrics or LLM-as-judge instead
- Domain-irrelevant benchmarks: MMLU performance doesn't predict domain-specific helpfulness. Always eval on representative task data
- Aggregate scores hiding slice failures: "85% overall" can mean 30% failure on a critical query category. Always stratify results
Process mistakes
- No regression testing: evaluating only new changes without a baseline means you cannot detect drift
- Evals not connected to deployment: results that don't gate merges get ignored. Implement quality gates
- Single-turn only: errors compound in multi-turn conversations. Build multi-turn test cases that check context retention and consistency
- Same model as judge: GPT-4 judging its own outputs inflates scores — use a different model or model family
- No post-deploy monitoring: metrics validated in test can drift in production as query distribution shifts
Minimum Viable Eval Setup
Start here before building anything more complex:
- Build a test set: 100 representative queries covering your bot's use cases, edge cases, and known failure modes
- Define 3–5 critical metrics for your bot type (faithfulness + answer relevance + latency for FAQ bots; task completion + slot accuracy for task bots)
- Implement LLM-as-judge scoring with a domain-specific rubric + few-shot examples; validate against 50 human-labeled cases
- Run the full suite before every production deploy — fail the deploy if any metric regresses beyond threshold
- Sample 20–50 real production conversations weekly; review failures and add them to the test set
This five-step setup catches the majority of regressions for under $5/day in eval costs. Add red-teaming, multi-agent judging, and comprehensive CI/CD integration as your system matures.
Checklist: Do You Understand This?
- Can you name two reasons why BLEU / ROUGE are poor metrics for LLM chatbot evaluation?
- Do you know which three metrics matter most for a RAG/FAQ bot vs a task bot vs a copilot?
- Can you explain the position bias and length bias problems with LLM-as-judge, and how to mitigate them?
- Do you understand why human evaluation and automated evaluation are complements, not substitutes?
- Can you describe what a regression suite is and what failure thresholds you would set for faithfulness and safety?
- Do you know the difference between RAGAS, DeepEval, and Promptfoo — and when you would reach for each?
- Can you describe the 2025 PRISM finding and what it implies about red-teaming cadence?