LoRA & QLoRA Approaches
Full fine-tuning requires duplicating all model weights during training ā impractical for large models. LoRA (Low-Rank Adaptation) trains only small adapter matrices, achieving near-full-tuning quality with a fraction of the compute and memory. QLoRA extends this to consumer GPUs by quantising the base model to 4-bit.
LoRA: The Core Idea
In full fine-tuning, all ~7ā70 billion weights are updated during backpropagation. LoRA instead freezes the base model entirely and adds small trainable adapter matrices alongside selected layers:
The math, simplified
For a weight matrix W (dimensions d Ć k), LoRA adds two small matrices: A (d Ć r) and B (r Ć k), where r is the rank (e.g. r=8 or r=16). During training, only A and B are updated. The effective weight becomes: W + BA. Since r << d and r << k, the number of trainable parameters is tiny compared to the full matrix.
Example: A 70B model has billions of weight parameters. With LoRA rank=16 applied to attention layers, you train ~50ā100 million parameters instead ā roughly 100Ć fewer.
Why LoRA Works
The hypothesis behind LoRA is that the weight updates needed for fine-tuning have low intrinsic rank ā meaning the direction of change in weight space can be captured by low-dimensional matrices. This has been empirically validated: LoRA fine-tunes achieve 90ā99% of full fine-tuning quality on most tasks.
Memory advantage: during training, you only need to load the base model weights (frozen, no gradients) plus the tiny adapter matrices. Gradient computation only flows through adapters.
QLoRA: LoRA on Consumer Hardware
QLoRA (Quantised LoRA) from Dettmers et al. (2023) extends LoRA by quantising the frozen base model to 4-bit NormalFloat (NF4):
- Base model loaded in 4-bit ā uses ~1/4 the VRAM of FP16
- Adapter matrices trained in 16-bit ā preserving gradient quality
- Paged optimiser states prevent OOM during training
Result: A 70B model that normally needs ~140GB VRAM to fine-tune can be fine-tuned on a single 24GB GPU (RTX 4090 or A10G). This democratised fine-tuning of large models significantly.
| Method | GPU VRAM for 70B model | Quality vs full fine-tune |
|---|---|---|
| Full fine-tuning (FP16) | ~280GB (weights + gradients + optimizer) | 100% (baseline) |
| LoRA (FP16 base) | ~140GB (frozen weights) + small adapter | ~95ā99% |
| QLoRA (NF4 base) | ~35ā40GB | ~92ā97% |
Key LoRA Parameters
- r (rank) ā Controls adapter capacity. Higher rank = more parameters = more capacity but slower training. Common values: 8, 16, 32, 64. Start with r=16 for most tasks.
- lora_alpha ā Scaling factor (typically set to r or 2r). Controls the magnitude of the adapter updates.
- target_modules ā Which weight matrices to apply LoRA to. Common choices:
q_proj, v_proj(attention only, cheaper) orall-linear(all linear layers, better quality). - lora_dropout ā Regularisation; typically 0.05ā0.1.
Training Data Format
Most LoRA training uses instruction-tuning format (also called "chat format"):
# Alpaca-style (common for instruction tuning)
{
"instruction": "Summarise this document in 3 bullet points.",
"input": "The document text...",
"output": "⢠Key point 1\n⢠Key point 2\n⢠Key point 3"
}
# Chat format (for conversational models)
{
"messages": [
{"role": "user", "content": "Summarise this: ..."},
{"role": "assistant", "content": "⢠Key point 1\n..."}
]
}Quality matters more than quantity. Each example should demonstrate exactly the behaviour you want. Remove duplicates, fix formatting errors, and remove examples where the "correct" output is ambiguous.
PEFT Library: Practical Implementation
Hugging Face's peft library makes LoRA training straightforward:
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear",
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,030,261,248 || trainable%: 1.04%Merging Adapters for Deployment
After training, you have a base model + adapter weights. For production deployment, you typically merge the adapter into the base model:
from peft import PeftModel
# Load base + adapter
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, adapter_path)
# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged-model")
# Convert to GGUF with llama.cpp for Ollama deploymentThe merged model behaves identically to the adapter-augmented version but runs with standard inference tools without needing the adapter loaded separately. Can then be quantised to GGUF for Ollama deployment.
Axolotl: Community Fine-Tuning Framework
Axolotl is the most popular open-source QLoRA training framework in the community:
- YAML-based configuration (no custom training code needed)
- Supports LoRA, QLoRA, full fine-tuning
- Multiple dataset format support (Alpaca, ShareGPT, Axolotl format)
- Multi-GPU training via DeepSpeed/FSDP
- Integration with Weights & Biases for training monitoring
Checklist: Do You Understand This?
- What is the core insight behind LoRA ā what does "low-rank adaptation" mean?
- How does QLoRA reduce VRAM requirements compared to standard LoRA?
- What are the
randtarget_modulesparameters and how do you choose them? - What does "merging adapters" mean and why do you do it before deployment?
- What tool is commonly used for community LoRA/QLoRA fine-tuning with YAML config?