Advanced

RAG Evaluation

RAG systems have two failure surfaces: retrieval (wrong chunks returned) and generation (Claude misrepresents retrieved content). Evaluation must cover both. This page covers the core metrics, the RAGAS framework, and how to use Claude itself as an evaluator at scale.

The Three Core Metrics

Faithfulness

Does the generated answer contain only claims that are supported by the retrieved context? Faithfulness measures hallucination rate: a faithful answer makes no claims that go beyond what the retrieved chunks say.

Score range: 0–1 (1 = every claim in the answer is supported by the retrieved context)
Failure mode: Claude elaborates or fills gaps using its parametric knowledge when the retrieved content is incomplete
Fix: Add "Do not include information not present in the provided context" to your RAG prompt

Answer Relevance

Does the answer actually address the user's question? A faithful answer that does not answer the question is still a failure.

Score range: 0–1 (1 = answer directly and completely addresses the question)
Failure mode: Claude retrieves tangentially related content and generates a technically accurate but off-topic answer
Fix: Improve retrieval quality; add question-answer alignment check to your prompt

Context Relevance (Retrieval Quality)

Was the right content retrieved? Context relevance measures what fraction of the retrieved chunks actually contain information needed to answer the question.

Score range: 0–1 (1 = every retrieved chunk is relevant to answering the question)
Failure mode: Top-k retrieval returns many chunks that are semantically nearby but not actually useful
Fix: Smaller k value, reranking, better chunking, or hybrid search

RAGAS: Automated RAG Evaluation

RAGAS (Retrieval Augmented Generation Assessment) is an open-source Python framework that computes all three metrics automatically using an LLM judge. It takes a dataset of questions, retrieved contexts, and generated answers and produces scores for each metric.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

# Your evaluation dataset
data = {
    "question": ["What is the refund policy?", "How do I reset my password?"],
    "answer": ["Refunds are available within 30 days...", "Go to Settings > Security..."],
    "contexts": [
        ["Policy doc chunk 1", "Policy doc chunk 2"],
        ["Help article chunk 1"]
    ],
    "ground_truth": ["30-day refund window", "Settings > Security > Reset Password"]
}

dataset = Dataset.from_dict(data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(result)

RAGAS uses an LLM (GPT-4 or Claude) to evaluate each metric. This means evaluation costs tokens but produces human-quality scores without manual labelling.

LLM-as-Judge: Using Claude to Score RAG Outputs

For production systems, you can implement your own LLM-as-judge evaluation that runs on each query log to monitor quality over time:

def evaluate_faithfulness(question: str, context: str, answer: str) -> float:
    """Ask Claude to score whether the answer is grounded in the context."""
    prompt = f"""Rate how faithful the answer is to the provided context.
Score 1 if every claim in the answer is supported by the context.
Score 0 if any claim goes beyond what the context says.
Score between 0 and 1 for partial faithfulness.

Context: {context}
Question: {question}
Answer: {answer}

Respond with only a number between 0 and 1."""

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return float(response.content[0].text.strip())

Use a fast, cheap model (Haiku) for automated evaluation — quality matters less than speed and cost when running at scale over query logs.

Building an Evaluation Dataset

A good evaluation dataset is the foundation of meaningful RAG metrics:

Question diversity: Cover different question types — factual, comparison, multi-hop, edge cases, and "not in the knowledge base" questions
Ground truth answers: Write expected answers for each question manually or use domain experts
Size: 50–200 questions is sufficient for most systems; focus on coverage, not quantity
Update regularly: Add questions that previously failed; include questions from real user logs

What to Measure and When

During development: Measure context relevance — diagnose retrieval quality before worrying about generation quality
Before launch: Full RAGAS evaluation on your gold dataset — establish baseline scores
In production: Sample-based LLM-as-judge evaluation on real queries — detect regressions without manual review
After changes: Re-run the full evaluation after any change to chunking strategy, embedding model, retrieval k, or prompt — changes that improve one metric can degrade another

Checklist: Do You Understand This?

Three metrics: faithfulness (hallucination rate), answer relevance (answers the question?), context relevance (right chunks retrieved?)
RAGAS: open-source framework that automates all three metrics using an LLM judge
LLM-as-judge: run Claude (Haiku for speed) as a quality scorer on production query logs
Evaluation dataset: 50–200 manually written question/answer pairs covering diverse question types
Measure context relevance first (retrieval quality) — if retrieval is wrong, generation quality doesn't matter