Advanced

Prompting vs RAG vs Fine-Tuning

The most common mistake in AI system design is reaching for fine-tuning too early. Most problems are better solved by improving prompts or adding retrieval augmentation. Fine-tuning has its place — but it's later in the escalation ladder than most people expect.

The Customisation Escalation Ladder

Start at the bottom. Only escalate when the level below genuinely cannot solve your problem, because each step up adds complexity and cost.

Level 1 — Better Prompting (start here)

System prompt engineering, few-shot examples, chain-of-thought, structured output instructions. Solves 80%+ of behaviour problems. Cost: developer time only.

Level 2 — RAG (add knowledge)

Inject relevant documents at query time. Solves knowledge currency, private data, attribution, and large corpus needs. Does not change model behaviour.

Level 3 — Fine-Tuning (narrow cases)

Train on your data to change model behaviour: style, tone, format, terminology, latency optimisation. Requires curated data, compute, and maintenance.

Level 4 — Pre-Training (almost never)

Train from scratch or continue pre-training. Only justified for highly specialised domains where base models fundamentally lack vocabulary. Millions of dollars.

What Better Prompting Achieves

Before concluding you need fine-tuning, exhaust these prompting approaches:

Detailed system prompt — Specify persona, domain context, output format, tone, what to avoid. A good system prompt can replicate most fine-tuning style effects.
Few-shot examples — 5–20 examples of input/output pairs showing exactly the behaviour you want. Often more reliable than fine-tuning on 50 examples.
Output format instructions — Structured JSON, specific section headers, prescribed response length. Models follow these reliably without fine-tuning.
Constraint and rubric prompting — Tell the model what not to do, what to always include, and how to score its own output.

When RAG is the Right Answer

Use RAG (not fine-tuning) when:

Knowledge currency — You need information beyond the model's training cutoff (product docs, recent events, regulations)
Attribution / citations — Answers must be traceable to source documents for compliance or user trust
Private data — You cannot include proprietary data in a fine-tuning dataset sent to an external provider
Changing data — Knowledge updates frequently; fine-tuning is static; retrieval is dynamic
Large knowledge base — No model context can hold a 10,000-page documentation set, but a vector index can

Fine-tuning and RAG are not mutually exclusive. You can fine-tune a model for style/format and use RAG for knowledge. Many production systems combine both.

Legitimate Fine-Tuning Use Cases

Good fine-tuning candidates

Consistent brand voice / tone across all outputs
Strict output format (specific JSON schema, report structure)
Domain-specific jargon and terminology that base models get wrong
Latency optimisation: fine-tuned small model matches large model quality on narrow task
Reducing prompt length (behaviour baked into weights, not repeated in every prompt)

Fine-tuning won't fix these

Factual knowledge past training cutoff — use RAG
Hallucination rate — model still invents facts; fine-tuning shifts style, not factual accuracy
General reasoning ability — you can't train a 7B model to reason like GPT-5
Consistent behavior across all tasks — fine-tuning optimises for training distribution
"Make it smarter" — capability is set at pre-training

Data Requirements

Fine-tuning requires training examples (input/output pairs):

Minimum for style/format: 50–200 high-quality examples often produce measurable improvement
Reliable domain adaptation: 500–5,000 examples covering the task variation you care about
Instruction fine-tuning: 10,000–100,000 diverse examples for robust general instruction following
Quality > quantity: 200 carefully curated examples typically outperforms 2,000 noisy examples

Cost Comparison Over 12 Months

Approach	Setup cost	Monthly cost at 1M queries	Maintenance burden
Better prompting	1–5 dev days	API costs only	Minimal
RAG	1–4 weeks dev	API + vector DB (~$100–500)	Document ingestion pipeline
Hosted fine-tuning (GPT-4o mini)	Data prep + $0.025/1K training tokens	Fine-tuned API cost (slightly lower)	Retraining on data drift
Self-hosted fine-tuning (QLoRA)	GPU time + significant engineering	GPU hosting + maintenance	High — model updates, hardware

Evaluating Fine-Tuning Results

Fine-tuning without rigorous evaluation is guesswork. Required steps:

Hold out test set

Reserve 10–20% before training starts

→

Define metric

Format compliance / BLEU / human preference

→

Eval base + tuned

Same test set; compare both models

→

OOD test

Out-of-distribution — did tuning hurt general capability?

→

A/B in production

Real users before full rollout

Fine-tuning without rigorous evaluation is guesswork — always compare against the base model on a held-out set

Checklist: Do You Understand This?

What are the four levels of the customisation escalation ladder?
Name three prompting approaches that should be exhausted before considering fine-tuning.
Name three legitimate fine-tuning use cases and three things fine-tuning cannot fix.
When is RAG the right answer instead of fine-tuning?
What is the minimum data requirement for fine-tuning to show improvement?
How do you evaluate whether fine-tuning actually helped?