Advanced

LoRA & QLoRA Approaches

Full fine-tuning requires duplicating all model weights during training — impractical for large models. LoRA (Low-Rank Adaptation) trains only small adapter matrices, achieving near-full-tuning quality with a fraction of the compute and memory. QLoRA extends this to consumer GPUs by quantising the base model to 4-bit.

LoRA: The Core Idea

In full fine-tuning, all ~7–70 billion weights are updated during backpropagation. LoRA instead freezes the base model entirely and adds small trainable adapter matrices alongside selected layers:

The math, simplified

For a weight matrix W (dimensions d × k), LoRA adds two small matrices: A (d × r) and B (r × k), where r is the rank (e.g. r=8 or r=16). During training, only A and B are updated. The effective weight becomes: W + BA. Since r << d and r << k, the number of trainable parameters is tiny compared to the full matrix.

Example: A 70B model has billions of weight parameters. With LoRA rank=16 applied to attention layers, you train ~50–100 million parameters instead — roughly 100× fewer.

Why LoRA Works

The hypothesis behind LoRA is that the weight updates needed for fine-tuning have low intrinsic rank — meaning the direction of change in weight space can be captured by low-dimensional matrices. This has been empirically validated: LoRA fine-tunes achieve 90–99% of full fine-tuning quality on most tasks.

Memory advantage: during training, you only need to load the base model weights (frozen, no gradients) plus the tiny adapter matrices. Gradient computation only flows through adapters.

QLoRA: LoRA on Consumer Hardware

QLoRA (Quantised LoRA) from Dettmers et al. (2023) extends LoRA by quantising the frozen base model to 4-bit NormalFloat (NF4):

Base model loaded in 4-bit — uses ~1/4 the VRAM of FP16
Adapter matrices trained in 16-bit — preserving gradient quality
Paged optimiser states prevent OOM during training

Result: A 70B model that normally needs ~140GB VRAM to fine-tune can be fine-tuned on a single 24GB GPU (RTX 4090 or A10G). This democratised fine-tuning of large models significantly.

Method	GPU VRAM for 70B model	Quality vs full fine-tune
Full fine-tuning (FP16)	~280GB (weights + gradients + optimizer)	100% (baseline)
LoRA (FP16 base)	~140GB (frozen weights) + small adapter	~95–99%
QLoRA (NF4 base)	~35–40GB	~92–97%

Key LoRA Parameters

r (rank) — Controls adapter capacity. Higher rank = more parameters = more capacity but slower training. Common values: 8, 16, 32, 64. Start with r=16 for most tasks.
lora_alpha — Scaling factor (typically set to r or 2r). Controls the magnitude of the adapter updates.
target_modules — Which weight matrices to apply LoRA to. Common choices: q_proj, v_proj (attention only, cheaper) orall-linear (all linear layers, better quality).
lora_dropout — Regularisation; typically 0.05–0.1.

Training Data Format

Most LoRA training uses instruction-tuning format (also called "chat format"):

# Alpaca-style (common for instruction tuning)
{
  "instruction": "Summarise this document in 3 bullet points.",
  "input": "The document text...",
  "output": "• Key point 1\n• Key point 2\n• Key point 3"
}

# Chat format (for conversational models)
{
  "messages": [
    {"role": "user", "content": "Summarise this: ..."},
    {"role": "assistant", "content": "• Key point 1\n..."}
  ]
}

Quality matters more than quantity. Each example should demonstrate exactly the behaviour you want. Remove duplicates, fix formatting errors, and remove examples where the "correct" output is ambiguous.

PEFT Library: Practical Implementation

Hugging Face's peft library makes LoRA training straightforward:

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,030,261,248 || trainable%: 1.04%

Merging Adapters for Deployment

After training, you have a base model + adapter weights. For production deployment, you typically merge the adapter into the base model:

from peft import PeftModel

# Load base + adapter
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, adapter_path)

# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged-model")
# Convert to GGUF with llama.cpp for Ollama deployment

The merged model behaves identically to the adapter-augmented version but runs with standard inference tools without needing the adapter loaded separately. Can then be quantised to GGUF for Ollama deployment.

Axolotl: Community Fine-Tuning Framework

Axolotl is the most popular open-source QLoRA training framework in the community:

YAML-based configuration (no custom training code needed)
Supports LoRA, QLoRA, full fine-tuning
Multiple dataset format support (Alpaca, ShareGPT, Axolotl format)
Multi-GPU training via DeepSpeed/FSDP
Integration with Weights & Biases for training monitoring

Checklist: Do You Understand This?

What is the core insight behind LoRA — what does "low-rank adaptation" mean?
How does QLoRA reduce VRAM requirements compared to standard LoRA?
What are the r and target_modules parameters and how do you choose them?
What does "merging adapters" mean and why do you do it before deployment?
What tool is commonly used for community LoRA/QLoRA fine-tuning with YAML config?