Intermediate

Synthetic Data for Evaluation

Building an evaluation dataset manually — collecting real user queries, labelling expected outputs, and verifying ground truth — is expensive and slow. Synthetic data generation uses LLMs to create evaluation examples automatically from your existing documents or domain knowledge. Done well, synthetic datasets are diverse, cover edge cases, and can be produced in hours rather than weeks. This page covers when synthetic data works, the main generation techniques, and the tools that automate it.

When Synthetic Data Works

Good use cases for synthetic data

RAG system evaluation: generate questions from your document corpus that the system should be able to answer
Edge case coverage: generate examples for scenarios that are rare in real traffic but critical to handle correctly
Early-stage evaluation before real user data exists
Augmenting small real datasets — real data stays the gold standard, synthetic fills gaps
Testing multilingual or domain-specific scenarios where labellers are scarce

When synthetic data fails

The generator model has the same failure modes as the system under test — synthetic data will not catch these
Real user queries have a distribution the LLM cannot replicate (slang, typos, cultural context)
Ground truth requires domain expertise the LLM lacks (legal interpretations, medical diagnosis)
Using synthetic data as a substitute for real data in final production evaluation — always validate against real users before launch

RAGAS TestsetGenerator

RAGAS (Retrieval Augmented Generation Assessment) provides an automated testset generator specifically designed for RAG evaluation. Given a document corpus, it generates question-answer-context triples using an evolutionary paradigm that produces diverse question types.

How RAGAS TestsetGenerator works:

Build a KnowledgeGraph from your documents — extracts entities, relationships, and key facts
Apply transformations to enrich the graph (summaries, paraphrases, cross-document links)
Synthesise questions using three query types:
- AbstractQuerySynthesizer (25%): high-level conceptual questions
- ComparativeAbstractQuerySynthesizer (25%): compare two concepts or documents
- SpecificQuerySynthesizer (50%): factual, detail-level questions
Generate ground-truth answers for each question using the source documents
Output: dataset of (question, ground_truth_answer, source_context) tuples ready for evaluation

RAGAS dataset quality

Evolutionary generation ensures variety — not just paraphrases of the same question
Abstract questions test reasoning; specific questions test retrieval accuracy
Comparative questions reveal whether the system can reason across multiple documents
Ground truth is document-grounded — provides a verifiable baseline for faithfulness evaluation

LLM-Based Test Case Generation

Beyond RAGAS, you can generate evaluation examples directly with a capable LLM. This is the right approach when your system is not RAG-based, or when you need examples that test specific behaviours not tied to a document corpus.

Prompt pattern for test case generation:

You are generating evaluation test cases for [DESCRIBE SYSTEM].

Generate [N] diverse test cases that cover:

- [Category 1: e.g., simple factual questions]

- [Category 2: e.g., multi-step reasoning]

- [Category 3: e.g., edge cases with ambiguous phrasing]

- [Category 4: e.g., adversarial inputs]

For each test case return JSON:

{"input": "...", "expected_output": "...", "category": "...", "difficulty": "easy|medium|hard"}

Use a more capable model than your system under test to generate test cases — GPT-4o or Claude 3.5 to generate tests for a system running on Haiku or GPT-4o-mini.

Question Type Coverage

Question type	What it tests	Target %
Simple factual	Basic retrieval accuracy — can the system find the answer?	30%
Multi-hop reasoning	Can the system connect facts from multiple documents?	20%
Abstractive summary	Can the system synthesise information, not just retrieve it?	20%
Out-of-scope	Does the system correctly decline questions outside its knowledge base?	15%
Comparative	Can the system compare entities, options, or time periods?	15%

Quality Control for Synthetic Data

Synthetic data can have quality problems: questions that are unanswerable from the corpus, ground truth answers that are wrong, or questions so easy they provide no signal. Always apply quality filtering before using synthetic data in evaluation.

Quality filtering checklist:

Answerability check: verify the ground-truth answer can actually be derived from the provided context — discard unanswerable examples
Ground truth verification: for critical evaluation sets, have a human review a 10% sample of synthetic Q&A pairs
Diversity check: measure embedding-based similarity across questions — discard near-duplicates (cosine similarity > 0.95)
Difficulty calibration: run your system against the dataset early; if it passes >95% immediately, the dataset is too easy — add harder examples
Bias audit: check that the dataset does not over-represent one section of your corpus or one question type

Synthetic vs Real Data — The Right Mix

Recommended proportion

Early stage (no real data): 100% synthetic — validate system feasibility
Beta / limited release: 70% synthetic, 30% real — synthetic fills coverage gaps
Production: ≥50% real — real user data should dominate the evaluation signal
Never use 100% synthetic data as the final quality gate before a major release

When to add real data

Any time a real user query caused a production failure — add it to the dataset
After each major feature launch — collect and label a sample of new traffic
When synthetic scores diverge from production quality — real data recalibrates

Checklist: Do You Understand This?

When does synthetic evaluation data work well, and when does it fail?
What are the three question types RAGAS TestsetGenerator produces, and what does each one test?
What model capability level should generate your test cases relative to the system under test?
What is the diversity check step in quality filtering, and why are near-duplicate questions a problem?
What proportion of real vs synthetic data should a production evaluation set contain?
How does the "difficulty calibration" quality check work, and what does it mean if your system passes >95% of the dataset immediately?