๐Ÿง  All Things AI
Advanced

LoRA & QLoRA โ€” Parameter-Efficient Fine-Tuning

Full fine-tuning of large language models is prohibitively expensive for most practitioners. A 7B parameter model requires around 56 GB of GPU memory once optimizer state is accounted for; a 70B model requires over 500 GB. LoRA (Low-Rank Adaptation, Hu et al., 2022) solved this problem elegantly by observing that the weight updates produced during fine-tuning have low intrinsic rank โ€” they can be well-approximated by the product of two small matrices. Instead of updating the original weight matrices directly, LoRA freezes them and trains only these small decomposition matrices, reducing trainable parameters by around 99% with minimal quality loss.

QLoRA (Dettmers et al., 2023) extended this by quantising the frozen base weights to 4-bit precision, slashing the memory cost of the base model itself. Together, LoRA and QLoRA have become the standard techniques for fine-tuning on consumer and research-grade hardware, enabling work that previously required multi-GPU server clusters.

The LoRA Mathematics

Consider a weight matrix W โˆˆ โ„dร—k in a transformer layer. During full fine-tuning, the update is ฮ”W with the same shape dร—k. LoRA instead constrains ฮ”W to be a product of two low-rank matrices:

W' = W + ฮ”W = W + B ยท A where: B โˆˆ โ„^(dร—r) โ€” the down-projection matrix A โˆˆ โ„^(rร—k) โ€” the up-projection matrix r << min(d, k) โ€” the rank (hyperparameter) Parameter count: Original ฮ”W: d ร— k (e.g. 4096 ร— 4096 = 16.7M params) LoRA B + A: dร—r + rร—k = r(d+k) (e.g. r=8: 8ร—8192 = 65K params) Reduction: ~256ร— fewer trainable parameters

During training, W is frozen and only A and B are updated. A is initialised with a random Gaussian; B is initialised to zeros, so ฮ”W = BA = 0 at the start of training โ€” the adapter contributes nothing initially and the model starts from exactly its pretrained behaviour. As training proceeds, A and B learn a low-rank correction to each weight matrix.

A scaling factor ฮฑ/r (where ฮฑ is a second hyperparameter, often set to match r or to a fixed value like 16) is applied to the product BA before adding it to W. This controls the magnitude of the adapter's contribution relative to the frozen weights and is the primary way to tune how aggressively the adapter influences the model.

Rank Selection and Expressiveness

The rank r is the central hyperparameter of LoRA. It controls the expressiveness of the adapter: a higher rank can represent more complex weight updates at the cost of more trainable parameters.

Rank rTrainable params (7B typical)Use case
r = 4~3M paramsMinimal task adaptation; style fine-tuning
r = 8~6M paramsGood default; most instruction-tuning tasks
r = 16~12M paramsMore complex behaviours; domain adaptation
r = 32 โ€“ 64~24โ€“48M paramsApproaching full fine-tuning quality; diminishing returns beyond r=64

The empirical finding from the original LoRA paper and subsequent work is that surprisingly low rank is sufficient for most fine-tuning tasks. The authors found r = 1 to 4 performed competitively on many benchmarks. This supports the underlying hypothesis that the weight updates that matter during fine-tuning live in a genuinely low-dimensional subspace of the full parameter space. Starting with r = 8 or r = 16 is a reliable default; increasing beyond r = 64 rarely yields meaningful improvement.

Where to Apply LoRA Adapters

LoRA can be applied to any weight matrix in the model, but in practice the choice of which matrices to adapt significantly affects both quality and efficiency.

Common: Q and V projection

The original LoRA paper applied adapters only to the query (W_Q) and value (W_V) projection matrices in each attention layer. This is still a common choice and produces strong results. The key, output, and FFN matrices are left frozen.

Better: Q, K, V, O + FFN

Adapting all four attention projections (Q, K, V, O) plus the two FFN weight matrices per block consistently outperforms Q+V only, at roughly 3ร— more trainable parameters. This is the default recommended by the Hugging Face PEFT library for most use cases.

Embedding layers

For tasks requiring new vocabulary or languages, adapting the embedding and language model head matrices can help. This is less common and requires careful handling of tokenizer changes.

Which layers (depth)

By default, LoRA is applied to all transformer layers equally. Some work suggests applying higher rank to earlier layers and lower rank to later ones, or targeting only the middle layers, but uniform application is simpler and generally competitive.

Merging Adapters at Inference

One of LoRA's most practically important properties is that the adapter can be merged into the base weights after training, eliminating any inference overhead. Since W' = W + BA, and both W and BA have the same shape, addition produces a new weight matrix with no additional computation at inference time. The merged model is indistinguishable from a full fine-tuned model in terms of architecture.

# Merging in Hugging Face PEFT from peft import PeftModel # Load base model + adapter model = AutoModelForCausalLM.from_pretrained(base_model_id) model = PeftModel.from_pretrained(model, adapter_path) # Merge: folds BA into W, removes adapter modules merged = model.merge_and_unload() # merged is now a standard model โ€” save and serve normally merged.save_pretrained('merged-model/')

Keeping adapters separate (unmerged) also has advantages: multiple adapters can be swapped onto the same base model at runtime, enabling multi-task serving from a single base model load. This is the basis of adapter serving systems where a fleet of fine-tuned personas or domain experts share one base model in GPU memory.

QLoRA: Fine-Tuning 70B Models on a Single GPU

QLoRA (Dettmers et al., 2023) combined LoRA with aggressive quantisation of the frozen base model weights. The key insight is that the base weights, once frozen, do not need high precision for the gradient computations that flow through the LoRA adapters โ€” they only need to be accurate enough for the forward pass. Quantising them to 4 bits drastically reduces the memory footprint of the frozen model, while the LoRA adapters themselves are maintained in BF16 for gradient stability.

NF4 Quantisation

NormalFloat4 (NF4) is an information-theoretically optimal quantisation format for weights that follow a normal distribution โ€” which pretrained model weights typically do. It stores each weight in 4 bits using quantisation levels spaced to minimise quantisation error for normally distributed values, outperforming standard int4 at equal bit width.

Double Quantisation

Quantisation requires storing scale constants (one per block of weights). Double quantisation reduces memory further by also quantising these scale constants โ€” quantising the quantisation constants. This saves roughly 0.37 bits per parameter on average, which accumulates significantly at 70B scale.

Paged Optimizers

Gradient updates for the LoRA adapter parameters use Adam, which maintains first and second moment estimates. QLoRA uses NVIDIA's unified memory to page optimizer state to CPU RAM when GPU memory is exhausted and page it back when needed. This handles the memory spikes that occur during gradient accumulation without out-of-memory crashes.

Practical Memory Numbers

The memory savings from LoRA and QLoRA are substantial and make previously inaccessible model sizes trainable on commodity hardware:

ModelFull fine-tune (BF16)LoRA (BF16 base)QLoRA (NF4 base)
Llama-3 7B~56 GB~16 GB~6 GB
Llama-3 13B~104 GB~28 GB~10 GB
Llama-2 70B~560 GB~140 GB~48 GB
Llama-3 70B~560 GB~140 GB~48 GB

A single NVIDIA H100 (80 GB) cannot run full fine-tuning of Llama-2-70B. With LoRA over a BF16 base it still cannot fit. With QLoRA it fits comfortably at 48 GB, leaving headroom for batch size and sequence length. A consumer RTX 4090 (24 GB) can QLoRA-fine-tune models up to roughly 13B parameters. A 7B model fits on a 3090 (24 GB) with QLoRA.

These numbers assume r = 16, all attention + FFN matrices adapted, sequence length 2048, and gradient accumulation to reach an effective batch size of 64. Reducing sequence length or rank, or using gradient checkpointing, further reduces peak memory.

Quality of LoRA vs Full Fine-Tuning

On most instruction-following and domain adaptation benchmarks, LoRA with r = 16 applied to all attention and FFN matrices achieves quality within a few percentage points of full fine-tuning. QLoRA, despite the aggressive 4-bit quantisation of the base model, typically degrades quality by less than 1% on held-out evaluations compared to LoRA on a BF16 base โ€” a remarkably small penalty for a 4ร— memory reduction.

The scenarios where full fine-tuning has a clear advantage are those requiring deep distributional shift: training a model on a completely new domain (say, a highly specialised scientific corpus), learning a new language from scratch, or tasks where the desired behaviour is very far from the model's pretrained capabilities. For the common case of instruction tuning, style adaptation, or domain grounding on top of a strong base model, LoRA/QLoRA is the practical default.

Checklist: Do You Understand This?

  • Can you write out the LoRA weight update equation (W' = W + BA) and explain the shapes of B and A, the role of rank r, and why B is initialised to zeros?
  • Can you calculate the number of trainable parameters introduced by LoRA with r = 8 applied to a 4096ร—4096 weight matrix, and compare it to the original matrix size?
  • Can you explain what "merging" a LoRA adapter means mathematically and describe the inference-time benefit of doing so?
  • Can you describe the three key innovations of QLoRA โ€” NF4 quantisation, double quantisation, and paged optimizers โ€” and explain what memory cost each one addresses?
  • Can you state the approximate GPU memory required to QLoRA fine-tune Llama-2-70B and explain which GPU(s) can accomplish this?
  • Can you explain the trade-offs between applying LoRA only to Q+V matrices versus all attention and FFN matrices, in terms of parameter count and quality?