Intermediate

RAG Evaluation

A RAG pipeline with no evaluation is a pipeline you cannot improve. You are flying blind — unable to tell whether a retrieval change helped, whether a prompt edit hurt faithfulness, or whether your system handles edge cases. This page builds a complete evaluation strategy: the right metrics at each pipeline stage, how to score them without manual annotation, how to build a useful test set, and which tools to use from local development through to production monitoring.

Eval Dataset

Q + ground-truth answers

→

Retrieval Metrics

Recall@k, MRR, precision

→

Generation Metrics

Faithfulness, relevance, RAGAS

→

Overall Score

Track over versions

RAG evaluation — measure retrieval and generation separately

What to Measure — The Two Subsystems

RAG has two distinct subsystems that can fail independently. Evaluation must cover both — measuring only the final answer misses failures that are fixable at the retrieval layer.

Subsystem	What can go wrong	Metrics to measure it
Retrieval	Wrong chunks returned, relevant chunks missed, too much noise	Context Precision, Context Recall, MRR, NDCG
Generation	Answer not grounded in context (hallucination), answer misses the question, answer wrong	Faithfulness, Answer Relevancy, Answer Correctness

Core Metrics

Faithfulness (Generation)

Every factual claim in the generated answer must be supported by the retrieved context. Score = (claims supported by context) / (total claims in answer). A score of 1.0 means the answer is fully grounded; 0.0 means every statement was invented.

What causes low faithfulness: LLM overrides context with training knowledge ("parametric override"), hallucinated details not in any retrieved chunk, partial context causing the LLM to fill in gaps

Fix direction: Stronger grounding instruction in system prompt; fewer but higher-precision chunks; check if answer in retrieved context at all (→ retrieval problem, not generation)

How it's scored: LLM-as-judge: decompose answer into atomic claims, check each against context chunks

Answer Relevancy (Generation)

Does the answer actually address the user's question? A faithful answer can still be irrelevant if it talks about related-but-wrong content. Score = semantic similarity between the answer and the original question.

What causes low relevancy: Retrieved chunks are about a related topic but not the specific question; LLM hedges excessively; system prompt constraints cause topic avoidance

How it's scored: RAGAS generates reverse questions from the answer and checks if they match the original question — no ground truth answer needed

Context Precision (Retrieval)

Of all the chunks retrieved, what proportion are actually relevant to the question? Score = (relevant chunks retrieved) / (total chunks retrieved). High noise in retrieval leads to low precision — the LLM gets confused by irrelevant content.

What causes low precision: k too large; poor embedding model; no metadata filtering; query too broad

Fix direction: Reduce k; add metadata pre-filters; use hybrid search; add reranker to drop irrelevant chunks

How it's scored: Requires knowing which chunks are relevant (ground truth labels) or LLM-as-judge for relevance

Context Recall (Retrieval)

Did retrieval surface all the information needed to answer the question? Score = (relevant chunks retrieved) / (total relevant chunks that exist in the corpus). A score below 1.0 means some required information was not retrieved — the LLM cannot answer correctly even if it wanted to.

What causes low recall: k too small; embedding mismatch (vocabulary gap); answer spans multiple chunks that were not all retrieved; metadata filter too strict

Fix direction: Increase k; use hybrid search (BM25 + vector); widen metadata filters; check chunking strategy for split answers

How it's scored: Requires a ground truth answer — RAGAS checks if all statements in the ground truth appear in retrieved context

Answer Correctness (End-to-End)

The composite metric: is the final answer factually correct relative to a ground truth answer? Combines faithfulness and semantic similarity to the reference. Requires a labelled dataset with reference answers — the most expensive metric to compute but most directly maps to user experience.

When to use: For high-stakes deployments where you have the budget to maintain a labelled eval set; for regression testing after major pipeline changes

When to skip: Early development — reference-free metrics (Faithfulness, Context Precision) give faster feedback loops with no labelling cost

Metric Quick Reference

Metric	Subsystem	Needs ground truth?	Start with?
Faithfulness	Generation	No (reference-free)	✓ Yes
Answer Relevancy	Generation	No (reference-free)	✓ Yes
Context Precision	Retrieval	Partial (relevance labels or LLM judge)	✓ Yes
Context Recall	Retrieval	Yes (reference answer)	Later
Answer Correctness	End-to-end	Yes (reference answer)	Later

LLM-as-Judge

Most RAG metrics require a judge to decide "is this claim supported by the context?" or "is this chunk relevant?". Human annotation is accurate but expensive and slow. The dominant 2025 approach is LLM-as-judge: use a capable LLM (GPT-4o, Claude 3.7) to perform the scoring, prompted with a structured rubric.

How it works: For each metric, the judge receives a structured prompt: the question, the retrieved context, the generated answer, and a rubric. It returns a score (0–1 or a category) and an explanation. The explanation is as important as the score — it lets you diagnose the failure.

Model choice for judge: Use a model one tier above your application model. If your RAG uses Claude 3.5 Haiku, judge with Claude 3.7 Sonnet or GPT-4o. Self-evaluation (using the same model) has a known bias toward rating its own outputs highly.

Calibration: Before trusting LLM judge scores, calibrate against 20–50 human-labelled examples. If judge accuracy on the calibration set is below ~85%, your rubric or model is too weak for the task.

Cost note: LLM-as-judge multiplies your evaluation token cost. For a test set of 200 questions with 5 chunks per question, a faithfulness evaluation adds ~200 × 5 judge calls. Budget accordingly or batch with a cheaper judge model.

Building a Test Set

Evaluation is only as good as the test questions. A poor test set produces misleading scores — the system looks great on easy questions while failing on the hard ones that matter to users.

What to include

Answerable questions — Questions whose answers are clearly in the knowledge base. These test retrieval and generation quality when the system should succeed.

Unanswerable questions — Questions whose answers are not in the knowledge base. These test whether the system correctly says "I don't know" rather than hallucinating. Often the most important category for production quality.

Multi-hop questions — Questions that require combining information from 2+ separate chunks. Test whether retrieval and synthesis both work for complex queries.

Adversarial questions — Questions designed to trigger known failure modes: vocabulary mismatch, recently updated documents, ambiguous terms, very short factual lookups vs. long explanatory answers.

Real user queries — Sampled from actual user logs once the system is live. Real queries are always the most important — they reveal failure modes you would not think to construct synthetically.

Size guidance

Early development: 50–100 questions is enough to detect major regressions. Start here — perfect is the enemy of useful.

Pre-launch: 200–500 questions covering all question types and document categories. Enough to detect 5–10% regressions reliably.

Production: Continuously augmented with sampled real queries, auto-labelled by LLM judge, with periodic human review of flagged cases.

Synthetic test set generation

RAGAS provides TestsetGenerator — it takes your corpus documents and automatically generates question/answer pairs by having an LLM read chunks and write questions. Useful for bootstrapping a test set quickly. Caveat: synthetic questions tend to be cleaner and easier than real user questions — supplement with adversarial and real-query examples.

Evaluation Frameworks

RAGAS — Reference-Free RAG Evaluation

The most widely adopted framework for RAG-specific evaluation. Open source (MIT). Provides Faithfulness, Answer Relevancy, Context Precision, Context Recall, and Answer Correctness out of the box. All metrics are implemented as LLM-as-judge pipelines — no labelled dataset required for the core four metrics.

Install: pip install ragas

Input: Dataset with columns: question, answer, contexts (list of retrieved chunks), optionally ground_truth

Output: Per-question and aggregate scores for each metric

2025 update: RAGAS v0.2+ supports custom LLM judges, async evaluation, and multi-turn conversation evaluation

Best for: Local development, CI/CD gates, fast iteration without infrastructure setup

Arize Phoenix — Observability + Evaluation

Open-source AI observability platform built on OpenTelemetry. Instruments your pipeline automatically (LangChain, LlamaIndex, OpenAI SDK) and captures every retrieval call, chunk, prompt, and response as a traced span. Includes built-in LLM-as-judge evaluators for RAG metrics.

Key advantage: Trace-level visibility — see exactly which chunks were retrieved for every query, not just aggregate scores

Deployment: Self-hosted (Docker) or Arize Cloud; free self-hosted tier

Best for: Production monitoring, debugging specific failed queries, teams who want observability and evaluation unified

LangSmith — Tracing + Dataset Management

LangChain's hosted observability and evaluation platform. Traces every chain/agent run, allows you to save runs as test examples, build datasets, and run evaluation experiments comparing pipeline versions side-by-side.

Key advantage: Dataset management — curate test sets directly from production traces; run A/B comparisons between pipeline versions

Pricing: Free tier (5K traces/month), Developer ($39/month), Plus (custom)

Best for: Teams using LangChain/LangGraph; systematic A/B testing of retrieval or prompt changes

DeepEval — Testing Framework for LLMs

Open-source evaluation framework designed to feel like pytest for LLMs. Write evaluation assertions as unit tests — integrates into CI/CD pipelines naturally. Supports RAG-specific metrics including Faithfulness, Contextual Precision, and Contextual Recall, plus hallucination detection and custom metrics.

Key advantage: CI/CD native — evaluation runs as part of your test suite before each deployment

Best for: Engineering teams who want evaluation gated in the CI pipeline, not just manual runs

Evaluation Workflow — Development to Production

Phase 1: Development (Local)

1. Build a 50-question test set — Mix answerable, unanswerable, and multi-hop. Use RAGAS TestsetGenerator to bootstrap, then manually add unanswerable and adversarial cases.

2. Run RAGAS baseline — Score your current pipeline on all 5 core metrics. Record the baseline. Every future change is measured as a delta from this.

3. Identify the weakest metric — Fix the worst metric first. If Faithfulness is 0.6, work on grounding before worrying about Answer Relevancy at 0.85. Optimise the bottleneck.

4. Isolate subsystem failures — Low Context Recall → retrieval problem. Low Faithfulness with high Context Recall → generation/prompting problem. Never tune both at once.

Phase 2: CI/CD Gates

Set regression thresholds: Define minimum acceptable scores for each metric — e.g. Faithfulness ≥ 0.80, Context Precision ≥ 0.75. Block deployment if any threshold is breached.

Fast eval set for CI: Keep a 20–50 question "smoke test" set that runs in CI (not the full 200-question suite). Full evaluation runs nightly or on release branches only.

Freeze the test set: Never re-generate or modify the test set between CI runs. If you regenerate it, your baseline scores are invalid. Treat the test set like a database migration — changes are versioned and deliberate.

Phase 3: Production Monitoring

Instrument with OpenTelemetry: Add Arize Phoenix or LangSmith tracing. Capture question, retrieved chunks, and answer for every production request. This creates a continuous stream of real evaluation data.

Sample-evaluate production traffic: Run LLM-as-judge faithfulness scoring on a random 5–10% sample of production requests. Cheaper than evaluating everything; sufficient for trend detection.

User feedback as signal: Thumbs up/down, explicit corrections, or follow-up "that's wrong" messages are direct quality signals. Log them alongside the trace — they are your highest-signal evaluation data.

Augment test set with production failures: When a production trace is flagged (low LLM judge score, user thumbs-down), curate it into your eval set after adding a reference answer. This closes the feedback loop — real failures drive future evaluation.

What Works Well

Start with the reference-free metrics

Faithfulness and Answer Relevancy require no labelled data, can be computed on any query, and give immediate signal. Do not delay evaluation because you "don't have a labelled dataset" — start with these today.

Trace-level debugging beats aggregate scores

An aggregate Faithfulness of 0.75 tells you there's a problem. Looking at the specific low-faithfulness traces tells you what to fix. Always dig into the failing examples — patterns emerge quickly (e.g. "all faithfulness failures involve tables" → fix table parsing).

Evaluate retrieval and generation separately

Separating retrieval metrics from generation metrics makes it much faster to identify where to invest. A low overall score could mean excellent generation on bad retrieval results — fixing retrieval will show a much larger improvement than tuning the generation prompt.

Unanswerable questions are your most important test cases

Most evaluation focuses on answerable questions. But the failure mode that most damages user trust is confident hallucination on out-of-scope questions. Include at least 20% unanswerable questions in your test set and measure the "correct abstention" rate explicitly.

Failure Modes in Evaluation Itself

Regenerating the test set between runs

If you regenerate your test set (e.g. with RAGAS TestsetGenerator) after every pipeline change, you are comparing apples to oranges. Scores cannot be compared across runs if the questions change. Fix and version your test set. Treat adding new questions as an explicit, deliberate action.

Optimising for the metric instead of the outcome

High Faithfulness can be achieved by making the system very conservative — only repeating exact phrases from the context. But users want coherent, synthesised answers. Calibrate LLM judge prompts against real user satisfaction, not just technical faithfulness, to avoid gaming.

Only evaluating the happy path

If your test set contains only clean, well-formed questions whose answers are clearly in the corpus, your scores will be optimistically high. Real user queries are messy, ambiguous, and sometimes unanswerable. Deliberately include hard cases.

No evaluation in production

Offline evaluation catches regressions but misses distribution shift — the real query patterns that emerge after launch are always different from your test set. A system with no production evaluation will degrade silently as user behaviour evolves or the knowledge base updates.

2025–2026 Developments

RAGAS v0.2 — multi-turn and agentic evaluation (2025)

RAGAS extended its framework to support multi-turn conversations (not just single Q&A pairs) and agentic RAG pipelines where the agent decides what to retrieve. Metrics now include agent-specific dimensions like tool selection accuracy and multi-hop coherence. The async evaluation engine significantly reduces wall-clock time for large test sets.

Giskard emerges as enterprise alternative (2025)

Giskard gained traction as an enterprise RAG evaluation platform with automated vulnerability scanning — it generates adversarial test cases targeting prompt injection, hallucination, and off-topic responses. Unlike RAGAS (quality metrics), Giskard focuses on safety and robustness testing, making the two complementary.

Evaluation standardisation across platforms

By 2025, Faithfulness, Context Precision, Context Recall, and Answer Relevancy had become de facto standard metric names across Arize Phoenix, LangSmith, Datadog LLM Observability, and major cloud platforms (AWS Bedrock evaluation, Azure AI evaluation). The RAGAS metric definitions are effectively the industry standard, even if implementations vary slightly.

Checklist: Do You Understand This?

Can you name the five core RAG metrics and which subsystem (retrieval vs. generation) each measures?
Which two metrics are reference-free (no ground truth needed)? Which two require a reference answer?
Can you explain how LLM-as-judge works for Faithfulness scoring, and what calibration means in this context?
What five types of questions should a good test set include?
Can you describe the three phases of evaluation maturity (local → CI/CD → production)?
What is the "regenerating test set" failure mode and why is it dangerous?
If Faithfulness is high but Context Recall is low, what does that tell you about where the failure is?
Can you name three evaluation frameworks and describe when you would use each?