RAG Evaluation
A RAG pipeline with no evaluation is a pipeline you cannot improve. You are flying blind — unable to tell whether a retrieval change helped, whether a prompt edit hurt faithfulness, or whether your system handles edge cases. This page builds a complete evaluation strategy: the right metrics at each pipeline stage, how to score them without manual annotation, how to build a useful test set, and which tools to use from local development through to production monitoring.
What to Measure — The Two Subsystems
RAG has two distinct subsystems that can fail independently. Evaluation must cover both — measuring only the final answer misses failures that are fixable at the retrieval layer.
| Subsystem | What can go wrong | Metrics to measure it |
|---|---|---|
| Retrieval | Wrong chunks returned, relevant chunks missed, too much noise | Context Precision, Context Recall, MRR, NDCG |
| Generation | Answer not grounded in context (hallucination), answer misses the question, answer wrong | Faithfulness, Answer Relevancy, Answer Correctness |
Core Metrics
Faithfulness (Generation)
Every factual claim in the generated answer must be supported by the retrieved context. Score = (claims supported by context) / (total claims in answer). A score of 1.0 means the answer is fully grounded; 0.0 means every statement was invented.
Answer Relevancy (Generation)
Does the answer actually address the user's question? A faithful answer can still be irrelevant if it talks about related-but-wrong content. Score = semantic similarity between the answer and the original question.
Context Precision (Retrieval)
Of all the chunks retrieved, what proportion are actually relevant to the question? Score = (relevant chunks retrieved) / (total chunks retrieved). High noise in retrieval leads to low precision — the LLM gets confused by irrelevant content.
Context Recall (Retrieval)
Did retrieval surface all the information needed to answer the question? Score = (relevant chunks retrieved) / (total relevant chunks that exist in the corpus). A score below 1.0 means some required information was not retrieved — the LLM cannot answer correctly even if it wanted to.
Answer Correctness (End-to-End)
The composite metric: is the final answer factually correct relative to a ground truth answer? Combines faithfulness and semantic similarity to the reference. Requires a labelled dataset with reference answers — the most expensive metric to compute but most directly maps to user experience.
Metric Quick Reference
| Metric | Subsystem | Needs ground truth? | Start with? |
|---|---|---|---|
| Faithfulness | Generation | No (reference-free) | ✓ Yes |
| Answer Relevancy | Generation | No (reference-free) | ✓ Yes |
| Context Precision | Retrieval | Partial (relevance labels or LLM judge) | ✓ Yes |
| Context Recall | Retrieval | Yes (reference answer) | Later |
| Answer Correctness | End-to-end | Yes (reference answer) | Later |
LLM-as-Judge
Most RAG metrics require a judge to decide "is this claim supported by the context?" or "is this chunk relevant?". Human annotation is accurate but expensive and slow. The dominant 2025 approach is LLM-as-judge: use a capable LLM (GPT-4o, Claude 3.7) to perform the scoring, prompted with a structured rubric.
Building a Test Set
Evaluation is only as good as the test questions. A poor test set produces misleading scores — the system looks great on easy questions while failing on the hard ones that matter to users.
What to include
Size guidance
Synthetic test set generation
TestsetGenerator — it takes your corpus documents and automatically generates question/answer pairs by having an LLM read chunks and write questions. Useful for bootstrapping a test set quickly. Caveat: synthetic questions tend to be cleaner and easier than real user questions — supplement with adversarial and real-query examples.Evaluation Frameworks
RAGAS — Reference-Free RAG Evaluation
The most widely adopted framework for RAG-specific evaluation. Open source (MIT). Provides Faithfulness, Answer Relevancy, Context Precision, Context Recall, and Answer Correctness out of the box. All metrics are implemented as LLM-as-judge pipelines — no labelled dataset required for the core four metrics.
pip install ragasDataset with columns: question, answer, contexts (list of retrieved chunks), optionally ground_truthArize Phoenix — Observability + Evaluation
Open-source AI observability platform built on OpenTelemetry. Instruments your pipeline automatically (LangChain, LlamaIndex, OpenAI SDK) and captures every retrieval call, chunk, prompt, and response as a traced span. Includes built-in LLM-as-judge evaluators for RAG metrics.
LangSmith — Tracing + Dataset Management
LangChain's hosted observability and evaluation platform. Traces every chain/agent run, allows you to save runs as test examples, build datasets, and run evaluation experiments comparing pipeline versions side-by-side.
DeepEval — Testing Framework for LLMs
Open-source evaluation framework designed to feel like pytest for LLMs. Write evaluation assertions as unit tests — integrates into CI/CD pipelines naturally. Supports RAG-specific metrics including Faithfulness, Contextual Precision, and Contextual Recall, plus hallucination detection and custom metrics.
Evaluation Workflow — Development to Production
Phase 1: Development (Local)
TestsetGenerator to bootstrap, then manually add unanswerable and adversarial cases.Phase 2: CI/CD Gates
Phase 3: Production Monitoring
What Works Well
Start with the reference-free metrics
Faithfulness and Answer Relevancy require no labelled data, can be computed on any query, and give immediate signal. Do not delay evaluation because you "don't have a labelled dataset" — start with these today.
Trace-level debugging beats aggregate scores
An aggregate Faithfulness of 0.75 tells you there's a problem. Looking at the specific low-faithfulness traces tells you what to fix. Always dig into the failing examples — patterns emerge quickly (e.g. "all faithfulness failures involve tables" → fix table parsing).
Evaluate retrieval and generation separately
Separating retrieval metrics from generation metrics makes it much faster to identify where to invest. A low overall score could mean excellent generation on bad retrieval results — fixing retrieval will show a much larger improvement than tuning the generation prompt.
Unanswerable questions are your most important test cases
Most evaluation focuses on answerable questions. But the failure mode that most damages user trust is confident hallucination on out-of-scope questions. Include at least 20% unanswerable questions in your test set and measure the "correct abstention" rate explicitly.
Failure Modes in Evaluation Itself
Regenerating the test set between runs
If you regenerate your test set (e.g. with RAGAS TestsetGenerator) after every pipeline change, you are comparing apples to oranges. Scores cannot be compared across runs if the questions change. Fix and version your test set. Treat adding new questions as an explicit, deliberate action.
Optimising for the metric instead of the outcome
High Faithfulness can be achieved by making the system very conservative — only repeating exact phrases from the context. But users want coherent, synthesised answers. Calibrate LLM judge prompts against real user satisfaction, not just technical faithfulness, to avoid gaming.
Only evaluating the happy path
If your test set contains only clean, well-formed questions whose answers are clearly in the corpus, your scores will be optimistically high. Real user queries are messy, ambiguous, and sometimes unanswerable. Deliberately include hard cases.
No evaluation in production
Offline evaluation catches regressions but misses distribution shift — the real query patterns that emerge after launch are always different from your test set. A system with no production evaluation will degrade silently as user behaviour evolves or the knowledge base updates.
2025–2026 Developments
RAGAS v0.2 — multi-turn and agentic evaluation (2025)
RAGAS extended its framework to support multi-turn conversations (not just single Q&A pairs) and agentic RAG pipelines where the agent decides what to retrieve. Metrics now include agent-specific dimensions like tool selection accuracy and multi-hop coherence. The async evaluation engine significantly reduces wall-clock time for large test sets.
Giskard emerges as enterprise alternative (2025)
Giskard gained traction as an enterprise RAG evaluation platform with automated vulnerability scanning — it generates adversarial test cases targeting prompt injection, hallucination, and off-topic responses. Unlike RAGAS (quality metrics), Giskard focuses on safety and robustness testing, making the two complementary.
Evaluation standardisation across platforms
By 2025, Faithfulness, Context Precision, Context Recall, and Answer Relevancy had become de facto standard metric names across Arize Phoenix, LangSmith, Datadog LLM Observability, and major cloud platforms (AWS Bedrock evaluation, Azure AI evaluation). The RAGAS metric definitions are effectively the industry standard, even if implementations vary slightly.
Checklist: Do You Understand This?
- Can you name the five core RAG metrics and which subsystem (retrieval vs. generation) each measures?
- Which two metrics are reference-free (no ground truth needed)? Which two require a reference answer?
- Can you explain how LLM-as-judge works for Faithfulness scoring, and what calibration means in this context?
- What five types of questions should a good test set include?
- Can you describe the three phases of evaluation maturity (local → CI/CD → production)?
- What is the "regenerating test set" failure mode and why is it dangerous?
- If Faithfulness is high but Context Recall is low, what does that tell you about where the failure is?
- Can you name three evaluation frameworks and describe when you would use each?