Prompting vs RAG vs Fine-Tuning
The most common mistake in AI system design is reaching for fine-tuning too early. Most problems are better solved by improving prompts or adding retrieval augmentation. Fine-tuning has its place — but it's later in the escalation ladder than most people expect.
The Customisation Escalation Ladder
Start at the bottom. Only escalate when the level below genuinely cannot solve your problem, because each step up adds complexity and cost.
System prompt engineering, few-shot examples, chain-of-thought, structured output instructions. Solves 80%+ of behaviour problems. Cost: developer time only.
Inject relevant documents at query time. Solves knowledge currency, private data, attribution, and large corpus needs. Does not change model behaviour.
Train on your data to change model behaviour: style, tone, format, terminology, latency optimisation. Requires curated data, compute, and maintenance.
Train from scratch or continue pre-training. Only justified for highly specialised domains where base models fundamentally lack vocabulary. Millions of dollars.
What Better Prompting Achieves
Before concluding you need fine-tuning, exhaust these prompting approaches:
- Detailed system prompt — Specify persona, domain context, output format, tone, what to avoid. A good system prompt can replicate most fine-tuning style effects.
- Few-shot examples — 5–20 examples of input/output pairs showing exactly the behaviour you want. Often more reliable than fine-tuning on 50 examples.
- Output format instructions — Structured JSON, specific section headers, prescribed response length. Models follow these reliably without fine-tuning.
- Constraint and rubric prompting — Tell the model what not to do, what to always include, and how to score its own output.
When RAG is the Right Answer
Use RAG (not fine-tuning) when:
- Knowledge currency — You need information beyond the model's training cutoff (product docs, recent events, regulations)
- Attribution / citations — Answers must be traceable to source documents for compliance or user trust
- Private data — You cannot include proprietary data in a fine-tuning dataset sent to an external provider
- Changing data — Knowledge updates frequently; fine-tuning is static; retrieval is dynamic
- Large knowledge base — No model context can hold a 10,000-page documentation set, but a vector index can
Fine-tuning and RAG are not mutually exclusive. You can fine-tune a model for style/format and use RAG for knowledge. Many production systems combine both.
Legitimate Fine-Tuning Use Cases
Good fine-tuning candidates
- Consistent brand voice / tone across all outputs
- Strict output format (specific JSON schema, report structure)
- Domain-specific jargon and terminology that base models get wrong
- Latency optimisation: fine-tuned small model matches large model quality on narrow task
- Reducing prompt length (behaviour baked into weights, not repeated in every prompt)
Fine-tuning won't fix these
- Factual knowledge past training cutoff — use RAG
- Hallucination rate — model still invents facts; fine-tuning shifts style, not factual accuracy
- General reasoning ability — you can't train a 7B model to reason like GPT-5
- Consistent behavior across all tasks — fine-tuning optimises for training distribution
- "Make it smarter" — capability is set at pre-training
Data Requirements
Fine-tuning requires training examples (input/output pairs):
- Minimum for style/format: 50–200 high-quality examples often produce measurable improvement
- Reliable domain adaptation: 500–5,000 examples covering the task variation you care about
- Instruction fine-tuning: 10,000–100,000 diverse examples for robust general instruction following
- Quality > quantity: 200 carefully curated examples typically outperforms 2,000 noisy examples
Cost Comparison Over 12 Months
| Approach | Setup cost | Monthly cost at 1M queries | Maintenance burden |
|---|---|---|---|
| Better prompting | 1–5 dev days | API costs only | Minimal |
| RAG | 1–4 weeks dev | API + vector DB (~$100–500) | Document ingestion pipeline |
| Hosted fine-tuning (GPT-4o mini) | Data prep + $0.025/1K training tokens | Fine-tuned API cost (slightly lower) | Retraining on data drift |
| Self-hosted fine-tuning (QLoRA) | GPU time + significant engineering | GPU hosting + maintenance | High — model updates, hardware |
Evaluating Fine-Tuning Results
Fine-tuning without rigorous evaluation is guesswork. Required steps:
Fine-tuning without rigorous evaluation is guesswork — always compare against the base model on a held-out set
Checklist: Do You Understand This?
- What are the four levels of the customisation escalation ladder?
- Name three prompting approaches that should be exhausted before considering fine-tuning.
- Name three legitimate fine-tuning use cases and three things fine-tuning cannot fix.
- When is RAG the right answer instead of fine-tuning?
- What is the minimum data requirement for fine-tuning to show improvement?
- How do you evaluate whether fine-tuning actually helped?