Synthetic Data for Evaluation
Building an evaluation dataset manually — collecting real user queries, labelling expected outputs, and verifying ground truth — is expensive and slow. Synthetic data generation uses LLMs to create evaluation examples automatically from your existing documents or domain knowledge. Done well, synthetic datasets are diverse, cover edge cases, and can be produced in hours rather than weeks. This page covers when synthetic data works, the main generation techniques, and the tools that automate it.
When Synthetic Data Works
Good use cases for synthetic data
- RAG system evaluation: generate questions from your document corpus that the system should be able to answer
- Edge case coverage: generate examples for scenarios that are rare in real traffic but critical to handle correctly
- Early-stage evaluation before real user data exists
- Augmenting small real datasets — real data stays the gold standard, synthetic fills gaps
- Testing multilingual or domain-specific scenarios where labellers are scarce
When synthetic data fails
- The generator model has the same failure modes as the system under test — synthetic data will not catch these
- Real user queries have a distribution the LLM cannot replicate (slang, typos, cultural context)
- Ground truth requires domain expertise the LLM lacks (legal interpretations, medical diagnosis)
- Using synthetic data as a substitute for real data in final production evaluation — always validate against real users before launch
RAGAS TestsetGenerator
RAGAS (Retrieval Augmented Generation Assessment) provides an automated testset generator specifically designed for RAG evaluation. Given a document corpus, it generates question-answer-context triples using an evolutionary paradigm that produces diverse question types.
How RAGAS TestsetGenerator works:
- Build a KnowledgeGraph from your documents — extracts entities, relationships, and key facts
- Apply transformations to enrich the graph (summaries, paraphrases, cross-document links)
- Synthesise questions using three query types:
AbstractQuerySynthesizer(25%): high-level conceptual questionsComparativeAbstractQuerySynthesizer(25%): compare two concepts or documentsSpecificQuerySynthesizer(50%): factual, detail-level questions
- Generate ground-truth answers for each question using the source documents
- Output: dataset of (question, ground_truth_answer, source_context) tuples ready for evaluation
RAGAS dataset quality
- Evolutionary generation ensures variety — not just paraphrases of the same question
- Abstract questions test reasoning; specific questions test retrieval accuracy
- Comparative questions reveal whether the system can reason across multiple documents
- Ground truth is document-grounded — provides a verifiable baseline for faithfulness evaluation
LLM-Based Test Case Generation
Beyond RAGAS, you can generate evaluation examples directly with a capable LLM. This is the right approach when your system is not RAG-based, or when you need examples that test specific behaviours not tied to a document corpus.
Prompt pattern for test case generation:
You are generating evaluation test cases for [DESCRIBE SYSTEM].
Generate [N] diverse test cases that cover:
- [Category 1: e.g., simple factual questions]
- [Category 2: e.g., multi-step reasoning]
- [Category 3: e.g., edge cases with ambiguous phrasing]
- [Category 4: e.g., adversarial inputs]
For each test case return JSON:
{"input": "...", "expected_output": "...", "category": "...", "difficulty": "easy|medium|hard"}
Use a more capable model than your system under test to generate test cases — GPT-4o or Claude 3.5 to generate tests for a system running on Haiku or GPT-4o-mini.
Question Type Coverage
| Question type | What it tests | Target % |
|---|---|---|
| Simple factual | Basic retrieval accuracy — can the system find the answer? | 30% |
| Multi-hop reasoning | Can the system connect facts from multiple documents? | 20% |
| Abstractive summary | Can the system synthesise information, not just retrieve it? | 20% |
| Out-of-scope | Does the system correctly decline questions outside its knowledge base? | 15% |
| Comparative | Can the system compare entities, options, or time periods? | 15% |
Quality Control for Synthetic Data
Synthetic data can have quality problems: questions that are unanswerable from the corpus, ground truth answers that are wrong, or questions so easy they provide no signal. Always apply quality filtering before using synthetic data in evaluation.
Quality filtering checklist:
- Answerability check: verify the ground-truth answer can actually be derived from the provided context — discard unanswerable examples
- Ground truth verification: for critical evaluation sets, have a human review a 10% sample of synthetic Q&A pairs
- Diversity check: measure embedding-based similarity across questions — discard near-duplicates (cosine similarity > 0.95)
- Difficulty calibration: run your system against the dataset early; if it passes >95% immediately, the dataset is too easy — add harder examples
- Bias audit: check that the dataset does not over-represent one section of your corpus or one question type
Synthetic vs Real Data — The Right Mix
Recommended proportion
- Early stage (no real data): 100% synthetic — validate system feasibility
- Beta / limited release: 70% synthetic, 30% real — synthetic fills coverage gaps
- Production: ≥50% real — real user data should dominate the evaluation signal
- Never use 100% synthetic data as the final quality gate before a major release
When to add real data
- Any time a real user query caused a production failure — add it to the dataset
- After each major feature launch — collect and label a sample of new traffic
- When synthetic scores diverge from production quality — real data recalibrates
Checklist: Do You Understand This?
- When does synthetic evaluation data work well, and when does it fail?
- What are the three question types RAGAS TestsetGenerator produces, and what does each one test?
- What model capability level should generate your test cases relative to the system under test?
- What is the diversity check step in quality filtering, and why are near-duplicate questions a problem?
- What proportion of real vs synthetic data should a production evaluation set contain?
- How does the "difficulty calibration" quality check work, and what does it mean if your system passes >95% of the dataset immediately?