🧠 All Things AI
Advanced

Prompting vs RAG vs Fine-Tuning

The most common mistake in AI system design is reaching for fine-tuning too early. Most problems are better solved by improving prompts or adding retrieval augmentation. Fine-tuning has its place — but it's later in the escalation ladder than most people expect.

The Customisation Escalation Ladder

Start at the bottom. Only escalate when the level below genuinely cannot solve your problem, because each step up adds complexity and cost.

1
Level 1 — Better Prompting (start here)

System prompt engineering, few-shot examples, chain-of-thought, structured output instructions. Solves 80%+ of behaviour problems. Cost: developer time only.

2
Level 2 — RAG (add knowledge)

Inject relevant documents at query time. Solves knowledge currency, private data, attribution, and large corpus needs. Does not change model behaviour.

3
Level 3 — Fine-Tuning (narrow cases)

Train on your data to change model behaviour: style, tone, format, terminology, latency optimisation. Requires curated data, compute, and maintenance.

4
Level 4 — Pre-Training (almost never)

Train from scratch or continue pre-training. Only justified for highly specialised domains where base models fundamentally lack vocabulary. Millions of dollars.

What Better Prompting Achieves

Before concluding you need fine-tuning, exhaust these prompting approaches:

  • Detailed system prompt — Specify persona, domain context, output format, tone, what to avoid. A good system prompt can replicate most fine-tuning style effects.
  • Few-shot examples — 5–20 examples of input/output pairs showing exactly the behaviour you want. Often more reliable than fine-tuning on 50 examples.
  • Output format instructions — Structured JSON, specific section headers, prescribed response length. Models follow these reliably without fine-tuning.
  • Constraint and rubric prompting — Tell the model what not to do, what to always include, and how to score its own output.

When RAG is the Right Answer

Use RAG (not fine-tuning) when:

  • Knowledge currency — You need information beyond the model's training cutoff (product docs, recent events, regulations)
  • Attribution / citations — Answers must be traceable to source documents for compliance or user trust
  • Private data — You cannot include proprietary data in a fine-tuning dataset sent to an external provider
  • Changing data — Knowledge updates frequently; fine-tuning is static; retrieval is dynamic
  • Large knowledge base — No model context can hold a 10,000-page documentation set, but a vector index can

Fine-tuning and RAG are not mutually exclusive. You can fine-tune a model for style/format and use RAG for knowledge. Many production systems combine both.

Legitimate Fine-Tuning Use Cases

Good fine-tuning candidates

  • Consistent brand voice / tone across all outputs
  • Strict output format (specific JSON schema, report structure)
  • Domain-specific jargon and terminology that base models get wrong
  • Latency optimisation: fine-tuned small model matches large model quality on narrow task
  • Reducing prompt length (behaviour baked into weights, not repeated in every prompt)

Fine-tuning won't fix these

  • Factual knowledge past training cutoff — use RAG
  • Hallucination rate — model still invents facts; fine-tuning shifts style, not factual accuracy
  • General reasoning ability — you can't train a 7B model to reason like GPT-5
  • Consistent behavior across all tasks — fine-tuning optimises for training distribution
  • "Make it smarter" — capability is set at pre-training

Data Requirements

Fine-tuning requires training examples (input/output pairs):

  • Minimum for style/format: 50–200 high-quality examples often produce measurable improvement
  • Reliable domain adaptation: 500–5,000 examples covering the task variation you care about
  • Instruction fine-tuning: 10,000–100,000 diverse examples for robust general instruction following
  • Quality > quantity: 200 carefully curated examples typically outperforms 2,000 noisy examples

Cost Comparison Over 12 Months

ApproachSetup costMonthly cost at 1M queriesMaintenance burden
Better prompting1–5 dev daysAPI costs onlyMinimal
RAG1–4 weeks devAPI + vector DB (~$100–500)Document ingestion pipeline
Hosted fine-tuning (GPT-4o mini)Data prep + $0.025/1K training tokensFine-tuned API cost (slightly lower)Retraining on data drift
Self-hosted fine-tuning (QLoRA)GPU time + significant engineeringGPU hosting + maintenanceHigh — model updates, hardware

Evaluating Fine-Tuning Results

Fine-tuning without rigorous evaluation is guesswork. Required steps:

Hold out test set
Reserve 10–20% before training starts
Define metric
Format compliance / BLEU / human preference
Eval base + tuned
Same test set; compare both models
OOD test
Out-of-distribution — did tuning hurt general capability?
A/B in production
Real users before full rollout

Fine-tuning without rigorous evaluation is guesswork — always compare against the base model on a held-out set

Checklist: Do You Understand This?

  • What are the four levels of the customisation escalation ladder?
  • Name three prompting approaches that should be exhausted before considering fine-tuning.
  • Name three legitimate fine-tuning use cases and three things fine-tuning cannot fix.
  • When is RAG the right answer instead of fine-tuning?
  • What is the minimum data requirement for fine-tuning to show improvement?
  • How do you evaluate whether fine-tuning actually helped?