Advanced

Prompt Tuning & Adapter Methods

Before LoRA became the dominant approach to parameter-efficient fine-tuning, several other methods were developed that occupy different points on the expressiveness–efficiency trade-off curve. Prompt tuning and prefix tuning operate entirely in the embedding space of the frozen model, training only the soft tokens that condition the model's behaviour. Adapter layers insert small bottleneck modules directly inside transformer blocks. Understanding these methods explains why LoRA converged as the practical standard — and also reveals the scenarios where the alternatives remain the better choice.

Prompt Tuning — Trainable Soft Tokens

Prompt tuning (Lester et al., 2021) is the simplest possible form of parameter-efficient adaptation: learn a small set of continuous token embeddings that, when prepended to the input, cause a frozen model to perform a target task. These learned tokens have no corresponding vocabulary entries — they are never decoded into text — but they occupy positions in the input sequence and influence every subsequent attention computation through the standard attention mechanism.

Hard prompt (discrete, not trained): [Classify the sentiment of:] The movie was wonderful. Soft prompt (continuous, trained): [v₁][v₂][v₃][v₄][v₅] The movie was wonderful. ↑ 5 learned embedding vectors prepended to input. v₁...v₅ are updated by gradient descent; model is frozen.

Only the soft prompt embeddings — typically 10 to 100 token vectors — are trained. For a model with a 4096-dimensional embedding space and 20 soft prompt tokens, this is 4096 × 20 = 81,920 parameters, compared to billions in the model. This is roughly 0.001% of the parameter count of a 7B model — an order of magnitude fewer than LoRA.

The original paper demonstrated that with models at or above 10B parameters, prompt tuning can approach full fine-tuning quality on SuperGLUE benchmarks. With smaller models, the gap is substantial — the model does not have sufficient capacity to translate soft prompt signals into complex behaviour changes via frozen weights alone. This size dependency is prompt tuning's main limitation.

Prefix Tuning — Per-Layer Trainable Prefixes

Prefix tuning (Li and Liang, 2021) is a more expressive variant that prepends trainable prefix vectors not just to the input embeddings, but to the keys and values of every attention layer in the model. This allows the learned prefix to directly influence the attention patterns at every layer rather than propagating only through attention from the input layer.

Standard attention at layer l: K_l = W_K · H_l V_l = W_V · H_l Attention(Q_l, K_l, V_l) Prefix tuning at layer l: K_l = concat([P^K_l, W_K · H_l]) V_l = concat([P^V_l, W_V · H_l]) Attention(Q_l, K_l, V_l) where P^K_l, P^V_l are trainable prefix matrices (frozen to fixed length)

Prefix vectors are inserted at every layer independently, so the total parameter count is: prefix_length × d_model × 2 (K and V) × num_layers. For a 12-layer model with d_model=768 and a prefix of length 10, this is 10 × 768 × 2 × 12 = 184,320 parameters — still well under 0.1% of a model's total. The original paper trained a reparameterisation MLP to generate the prefix vectors (to avoid poor initialisation), then discarded the MLP after training.

Prefix tuning consistently outperforms input-only prompt tuning, especially on smaller models, because each layer receives a direct conditioning signal rather than relying on the attention mechanism to propagate influence from the input layer. The cost is modest additional complexity and slightly more parameters.

Adapter Layers — Bottleneck Modules Inside Transformer Blocks

Adapter layers (Houlsby et al., 2019) take a structurally different approach. Rather than modifying the input space, they insert small trainable modules directly inside each transformer block. Each adapter is a bottleneck MLP: a down-projection to a low-dimensional representation, a nonlinearity, and an up-projection back to the original dimension, with a residual skip connection.

Adapter module: h → LayerNorm(h) → W_down (d → m) [down-project to bottleneck dim m] → ReLU (or GELU) → W_up (m → d) [up-project back to d] → + h [residual skip] Position in transformer block (Houlsby et al. original): Multi-Head Attention → Add+Norm → [Adapter] → FFN → Add+Norm → [Adapter] Parameter count per adapter: d×m + m + m×d + d = 2dm + m + d ≈ 2dm For d=768, m=64: 2 × 768 × 64 = 98,304 params per adapter

The bottleneck dimension m controls the capacity of each adapter. At m = 64 for a d_model = 768 model, each adapter introduces roughly 100K parameters. With 12 layers and 2 adapters per layer, the total is approximately 2.4M — around 3–4% of a BERT-base model's parameters. This is larger than LoRA for equivalent tasks but still dramatically smaller than full fine-tuning.

Adapters were the primary PEFT method before LoRA and still appear in some production systems, particularly in natural language processing applications built on encoder models (BERT, RoBERTa) where they have a well-established track record.

Adapters vs LoRA: The Inference Latency Difference

The fundamental disadvantage of adapter layers relative to LoRA is inference latency. Adapters insert sequential computation into the forward pass — the adapter module runs after the attention sublayer and must complete before the FFN sublayer begins. This adds wall-clock time per layer, which compounds over 32 or more layers in a large model. On a GPU, this matters because the adapter computations are small and do not saturate the GPU's parallelism, creating underutilisation relative to the main model computation.

Adapter layers — inference overhead

Adapter modules are sequential — they cannot be parallelised with the layers they follow. Benchmarks report 4–8% latency increases on GPU inference, rising with the number of adapters per layer and the batch size. The overhead is fixed regardless of whether the adapter is needed.

LoRA — zero inference overhead

After merging BA into W, LoRA produces a single weight matrix with the exact same shape as the original. The merged forward pass is mathematically identical to the unmodified model. No additional compute at inference. This is the decisive practical advantage of LoRA over adapters for production deployment.

Other PEFT Methods: (IA)³, LoftQ, and Beyond

The PEFT landscape extends beyond LoRA, adapters, and prefix tuning. Several additional methods target extreme parameter efficiency or specific deployment constraints.

(IA)³

"Infused Adapter by Inhibiting and Amplifying Inner Activations" (Liu et al., 2022). Instead of adding matrices, (IA)³ trains learned rescaling vectors that element-wise scale the keys, values, and FFN activations. Only ~0.01% of parameters — even fewer than prompt tuning — yet competitive on few-shot classification tasks. Works by rescaling rather than adding, so it can also be merged at inference.

LoftQ

LoftQ (Liu et al., 2023) improves on QLoRA's initialisation. Instead of initialising LoRA with random A and zero B over a quantised base, LoftQ alternates between quantisation and SVD decomposition to find an initialisation where the quantisation error is absorbed into the initial LoRA adapter values. This reduces the mismatch between the quantised and full-precision weights at the start of training, producing better final quality than standard QLoRA on the same hardware.

DoRA

DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al., 2024) decomposes each weight matrix into its magnitude and direction components and applies LoRA only to the direction. The magnitude (a scalar per output neuron) is trained separately. DoRA consistently outperforms LoRA at equal rank across a range of tasks, with the same parameter count, and has been adopted as a drop-in alternative in the PEFT library.

The Hugging Face PEFT Library

In practice, all of these methods are accessed through the Hugging Face PEFT library, which provides a unified interface for configuring, applying, saving, and loading adapters for any supported model architecture.

from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model, TaskType # 1. Load base model model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B') # 2. Configure LoRA lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank lora_alpha=32, # scaling factor lora_dropout=0.05, target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], bias='none', ) # 3. Wrap model — freezes base, adds LoRA modules model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 83,886,080 || all params: 8,114,245,632 || 1.03% # 4. Train normally with your SFT loop... # 5. Save adapter only (not base model) model.save_pretrained('my-lora-adapter/')

The PEFT library handles the mechanics of freezing base weights, routing gradients only through adapter parameters, and providing model-architecture-specific default target module names. Switching from LoRA to DoRA, (IA)³, or prefix tuning requires changing only the config class.

When to Use Which Method

Scenario	Recommended method	Reason
Standard instruction tuning on any GPU	LoRA (r=8–16)	Best quality-to-compute ratio; zero inference overhead when merged
Fine-tuning 30B–70B on consumer hardware	QLoRA (NF4 + LoRA)	4-bit base enables single-GPU training; quality loss minimal
API-only model access (no weight access)	Prompt tuning / hard prompts	Only option when you cannot touch model weights; requires large model for effectiveness
Extreme parameter budget (<0.01%)	(IA)³	Fewest trainable params; mergeable at inference; good for classification/few-shot tasks
Serving many task-specific adapters from one base	LoRA (unmerged)	Adapters can be swapped at runtime; base loaded once; enables multi-tenant adapter serving
Maximising quality regardless of cost	Full fine-tuning	Necessary for deep distributional shift; new languages; highly specialised domains

Checklist: Do You Understand This?

Can you explain the difference between a hard prompt and a soft prompt, and describe precisely what is trained during prompt tuning and what is frozen?
Can you describe where prefix tuning inserts its trainable parameters — specifically which components of which layers — and explain why this makes it more expressive than input-only prompt tuning?
Can you draw the architecture of an adapter module (down-project → nonlinearity → up-project → residual skip) and explain why the residual connection is important?
Can you explain why adapter layers introduce inference latency but LoRA does not, and describe the specific mechanism by which LoRA achieves zero overhead after deployment?
Can you describe what (IA)³ trains (rescaling vectors), explain how they are applied during the forward pass, and state the approximate parameter count relative to LoRA?
Can you list the five scenarios in the comparison table above and justify the method recommendation for each one?