🧠 All Things AI
Advanced

Supervised Fine-Tuning (SFT)

A pre-trained language model is a powerful statistical engine that has learned to predict the next token across a vast corpus of text. What it has not learned is how to be useful: how to follow an instruction, refuse an unsafe request, or structure an answer clearly. Supervised Fine-Tuning (SFT) is the first post-training step that changes this. Given a dataset of (input, desired output) pairs — where the input is a prompt or instruction and the output is the ideal response — SFT continues training the model on these examples, nudging the weights to favour responses that match the demonstrated behaviour. The result is an instruction-following model that forms the foundation for further alignment stages such as RLHF or DPO.

SFT is conceptually simple: it is standard supervised learning with cross-entropy loss applied to a pre-trained model rather than a randomly initialised one. The difficulty lies in the data — curating high-quality, diverse instruction-response pairs is expensive and the single most important lever for the quality of the resulting model.

Instruction Tuning

Instruction tuning is SFT applied specifically to teach a model to follow natural-language instructions across diverse task types. Rather than training on examples of a single task, the model sees thousands of different task formats — summarisation, translation, question answering, coding, classification — framed as instructions. The hypothesis, validated by FLAN (Wei et al., 2021), is that training on a sufficiently diverse set of tasks in instruction format causes the model to generalise to unseen tasks expressed as instructions at inference time.

FLAN (Google, 2021)

Fine-tuned LaMDA on 60+ NLP tasks phrased as instructions. Demonstrated that instruction tuning on diverse tasks dramatically improves zero-shot performance on held-out tasks. Established instruction diversity as a key axis.

InstructGPT (OpenAI, 2022)

Pioneered the RLHF pipeline. The first stage — the SFT step — used human-written demonstrations of ideal assistant behaviour. A GPT-3 model fine-tuned only on these SFT examples already outperformed the much larger base GPT-3 on human preference ratings.

Alpaca (Stanford, 2023)

Used GPT-3.5 to generate 52,000 instruction-response pairs from 175 seed tasks via self-instruct. Fine-tuned LLaMA-7B on this synthetic data. Produced a surprisingly capable instruction follower at minimal cost, demonstrating synthetic data as a viable path.

Vicuna (LMSYS, 2023)

Fine-tuned LLaMA on 70,000 ChatGPT conversations shared by users. Showed that conversations (multi-turn dialogue) rather than single-turn instruction-response pairs can produce strong conversational capability. Multi-turn data format matters.

Data Format and Chat Templates

Fine-tuning data must match the token format the model will see at inference time. There is no universal format — each model family uses its own special tokens and role conventions, and mismatching them causes subtle but severe degradation.

# Llama 2 chat format <s>[INST] <<SYS>> You are a helpful assistant. <</SYS>> What is the capital of France? [/INST] The capital of France is Paris.</s> # Llama 3 / instruct format <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|> <|start_header_id|>user<|end_header_id|> What is the capital of France?<|eot_id|> <|start_header_id|>assistant<|end_header_id|> The capital of France is Paris.<|eot_id|> # ChatML format (Mistral, many open models) <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is the capital of France?<|im_end|> <|im_start|>assistant The capital of France is Paris.<|im_end|>

The practical consequence is that your training data pipeline must apply the correct chat template for the specific model you are fine-tuning — not just the architecture family but the specific checkpoint, since base and instruct variants sometimes differ. The Hugging Face tokenizer.apply_chat_template() method handles this automatically for supported models, reading the template from the tokenizer config. Using the wrong format trains the model to expect delimiters that will not appear at inference, or vice versa, producing garbled outputs.

Training Loss: Why Masking the Instruction Tokens Matters

SFT uses standard cross-entropy loss, but with a crucial detail: loss is computed only on the output (response) tokens, not on the input (instruction) tokens. This is called completion masking or response-only loss.

Sequence: [INST] What is 2+2? [/INST] The answer is 4. ← loss here only → Labels: -100 -100 -100 -100 -100 The answer is 4.

Setting instruction token labels to -100 causes PyTorch's cross-entropy loss to ignore them. This matters for several reasons. First, the instruction text is often deterministic — there is only one way the instruction was phrased — so including it in the loss teaches the model to predict the exact words of your dataset's prompts rather than to generate responses. Second, gradient quality degrades: if the model has already seen the instruction during the forward pass (it has — it always processes the full sequence as context), forcing it to predict that same instruction as a target adds loss signal that has nothing to do with response quality. Omitting instruction tokens produces cleaner gradients focused entirely on whether the model generates good completions.

In multi-turn conversations, the convention extends naturally: loss is computed only on the assistant turns, not on any user or system turns, regardless of where they appear in the sequence.

Dataset Size and Quality

The central finding of the LIMA paper (Zhou et al., 2023) — "Less is More for Alignment" — is that 1,000 carefully curated, diverse examples can instruction-tune a strong base model to near-state-of-the-art performance. This challenges the intuition that more data is always better and has significant practical implications.

What makes data high-quality
  • Responses written by skilled human annotators or strong models
  • Diverse task coverage — instruction types, domains, lengths
  • Accurate, well-reasoned responses (factual correctness matters)
  • Consistent format and style matching target inference behaviour
  • Absence of toxic, biased, or misleading examples
What large noisy datasets do
  • Introduce contradictory style signals that confuse the model
  • Dilute rare high-quality examples with low-quality repetitive ones
  • Require more compute without proportional quality gain
  • May include formatting artifacts, truncated responses, factual errors
  • Risk encoding distribution biases that are hard to diagnose

Practical dataset sizes for SFT span a wide range: domain-specific task adaptation can work with 1K–10K examples; general instruction following typically uses 50K–500K; the largest published SFT datasets (Tulu, OpenHermes, SlimOrca) contain 100K–1M examples but apply quality filtering. The diminishing returns curve is steep — the first 10K diverse, clean examples do far more work than examples 100K–110K.

Full Fine-Tuning vs Parameter-Efficient Methods

Full fine-tuning updates every parameter in the model. For a 7B parameter model in BF16, that is 14 GB for the weights alone, plus roughly 3× more for optimizer state (Adam keeps first and second moment estimates per parameter), totalling ~56 GB of GPU memory — and that is before activations and the training batch. A 70B model requires an impractical ~560 GB. Full fine-tuning is maximally flexible but costly.

MethodParams trainedMemory (7B)Best for
Full fine-tuning100%~56 GBMaximal customisation, large dataset
LoRA~1%~14 GBMost tasks; widely used default
QLoRA~1%~6 GBConsumer GPU; large base models
Prompt tuning<0.01%~14 GBAPI-only access; extreme budget

Parameter-efficient fine-tuning (PEFT) methods freeze the original weights and add or modify a small number of trainable parameters. The optimizer therefore only maintains state for those few parameters, dramatically reducing memory. The base model weights remain unchanged, enabling the adapter to be swapped without reloading the full model and allowing multiple adapters to share a single base.

Catastrophic Forgetting

Catastrophic forgetting is the tendency of a neural network to lose previously learned capabilities when trained on new data. When you fine-tune a general-purpose model on a narrow domain — say, only medical question answering — gradient updates shift the weights toward that distribution, degrading performance on tasks not represented in the fine-tuning data. The model that was previously competent at coding, summarisation, and translation may lose meaningful capability on all three.

The severity depends on several factors: how many gradient steps are taken, how different the fine-tuning distribution is from pretraining, and how aggressively the learning rate is set. A model fine-tuned for 1 epoch on 10K diverse examples typically forgets much less than one fine-tuned for 10 epochs on 1K narrow examples.

Data mixing

Mix a fraction of general pretraining data into the fine-tuning set. Even 10–20% general data can anchor the model's broad capabilities while still shifting it toward the target distribution. Trade-off: requires access to suitable general data and more training compute.

Low learning rate

Fine-tuning at a much lower learning rate than pretraining (typically 1e-5 to 5e-5) makes smaller weight updates per step, preserving more of the pretrained distribution. Combined with early stopping at 1–3 epochs, this is the simplest forgetting mitigation.

PEFT methods

Because LoRA and other PEFT methods freeze the base weights entirely, they provide structural protection against forgetting: the original parameter values literally cannot change. Only the small adapter parameters update, and these are added to the frozen weights rather than replacing them.

Key SFT Hyperparameters

The following hyperparameters have the most impact on SFT outcome:

ParameterTypical rangeNotes
Learning rate1e-5 – 5e-5Much lower than pretraining; use cosine decay with warmup
Epochs1 – 3More epochs increase forgetting risk and overfitting on small datasets
Batch size16 – 128 (effective)Use gradient accumulation to achieve large effective batch sizes on limited hardware
Max sequence length512 – 8192 tokensLonger sequences increase memory quadratically (attention); pack short sequences to maximise GPU utilisation
Warmup steps50 – 200 stepsLinear warmup prevents large early updates from damaging pretrained weights

Checklist: Do You Understand This?

  • Can you explain why cross-entropy loss is applied only to response tokens during SFT, and describe what happens if you include instruction tokens in the loss?
  • Can you describe what chat templates are, name at least two different formats, and explain what goes wrong when the training format does not match the inference format?
  • Can you summarise the LIMA paper's finding and explain why dataset quality can outperform dataset quantity in SFT?
  • Can you describe the memory cost breakdown for full fine-tuning a 7B model (weights + optimizer state) and explain why PEFT methods dramatically reduce this?
  • Can you define catastrophic forgetting, explain what causes it during SFT, and describe three mitigation strategies and their trade-offs?
  • Can you compare FLAN, InstructGPT, Alpaca, and Vicuna — what data did each use, and what did each demonstrate about instruction tuning?