Advanced

Memory Bandwidth — The Real Bottleneck

When engineers first look at GPU specs, they focus on FLOPS — the compute throughput. But for LLM inference, the limiting factor is almost never FLOPS. It is memory bandwidth: how fast the GPU can load model weights from HBM (High Bandwidth Memory) into the compute units. Understanding this shapes every inference optimization decision, from quantization to batching to Flash Attention.

The Roofline Model

The Roofline model is a visual performance model that predicts whether a computation is limited by compute or memory bandwidth. It has two key parameters:

Peak compute (FLOPS): the maximum arithmetic throughput of the hardware (e.g., 989 TFLOPS BF16 on H100)
Peak memory bandwidth: the maximum rate of data movement from HBM (e.g., 3.35 TB/s on H100)

The arithmetic intensity of a kernel is the ratio of floating-point operations performed to bytes of memory accessed:

Arithmetic Intensity = FLOPs performed / Bytes read from memory

Ridge point = Peak FLOPS / Peak Bandwidth

H100 ridge point = 989 TFLOPS / 3.35 TB/s ≈ 295 FLOPs/byte

If a kernel's arithmetic intensity is above the ridge point, it is compute-bound — adding more bandwidth would not help; it is already using all the FLOPS. If arithmetic intensity is below the ridge point, it is memory-bound — adding more compute would not help; it is waiting for data from HBM.

Memory-Bound

Waiting for data from HBM

Compute-Bound

Waiting for FLOPS to complete

LLM decode (single req)

Small batch inference

Large batch inference

Prefill / Training

Why LLM Decoding Is Memory-Bound

During autoregressive token generation (the decoding phase), the model generates exactly one token per forward pass. Consider what that means for a 70B parameter model in FP16:

Item	Value
Model parameters	70 billion
Weight size in FP16	70B × 2 bytes = 140 GB
FLOPs per token (decode)	~2 × 70B = 140 GFLOPs
Arithmetic intensity (decode)	140 GFLOPs / 140 GB ≈ 1 FLOP/byte
H100 ridge point	~295 FLOPs/byte
Result	~295× below ridge point — massively memory-bound

The ceiling for single-request inference is determined by bandwidth alone. H100 can load 140 GB of weights at 3.35 TB/s in about 42ms — giving a theoretical maximum of roughly 24 tokens/second for Llama 70B, regardless of how many FLOPS the chip has. The GPU's 989 TFLOPS are almost entirely idle during single-request decode.

KV Cache — The Memory Complication

Transformer attention requires access to all previous tokens' key and value vectors when generating each new token. Without caching, the model would recompute all previous keys and values on every step — wasting enormous compute. The KV cache stores these tensors in HBM, trading memory for compute.

But KV cache adds significantly to memory pressure:

KV cache size formula

KV cache bytes = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element

For Llama 3 70B at 8K context, batch size 1, FP16: ≈ 2 × 80 × 8 × 128 × 8192 × 1 × 2 ≈ 21 GB. That is 21 GB of HBM consumed just for one conversation — on top of the 140 GB model weights. H100 has 80 GB total. This is why Llama 70B requires 2× H100s for inference at meaningful context lengths.

For very long contexts (128K+ tokens), the KV cache can exceed the model weights in size. This is a key driver of techniques like sliding window attention, GQA (Grouped Query Attention), and MLA (Multi-head Latent Attention, used in DeepSeek-V2/V3) that compress or reduce KV cache storage.

Batching — The Primary Solution

The most effective fix for memory-bound inference is batching: processing multiple requests simultaneously. When 32 requests are batched together, the GPU loads model weights once and produces 32 output tokens in the same time it would produce 1 token for a single request. Arithmetic intensity scales with batch size.

Batch Size	Arithmetic Intensity	Bound	Tokens/sec (H100, 70B FP16)
1	~1 FLOP/byte	Memory-bound	~24 tok/s
16	~16 FLOPs/byte	Memory-bound	~380 tok/s total
256	~256 FLOPs/byte	Near ridge point	~5,000+ tok/s total
512+	>295 FLOPs/byte	Compute-bound	Scales with FLOPS

Continuous batching (used by vLLM, TGI) goes further: instead of waiting for all requests in a batch to finish before starting new ones, it dynamically slots new requests into the batch as slots free up. This dramatically improves GPU utilization for production serving.

Prefill vs Decode — Two Different Problems

LLM serving has two fundamentally different phases with different performance profiles:

Prefill

Process all input tokens in parallel — compute-bound

→

Decode

Generate output tokens one at a time — memory-bound

Prefill and decode have opposite performance characteristics

Prefill Phase

All input tokens are processed simultaneously. For a 2000-token prompt, the GPU computes attention over 2000 tokens in parallel — high arithmetic intensity, compute-bound.

Optimization: maximize batch size of prompts, use Flash Attention for memory efficiency, use tensor parallelism to distribute attention computation.

Decode Phase

One token generated per forward pass. Arithmetic intensity proportional to batch size. At small batches, almost entirely memory-bound — most FLOPS sit idle.

Optimization: maximize batch size via continuous batching, quantize weights to reduce bytes loaded, use speculative decoding to amortize memory reads.

Flash Attention — IO-Aware Attention

Standard attention computes the full N×N attention score matrix (where N is sequence length) and materializes it in HBM. For a 32K context with 64 heads, this is enormous — and the reading and writing of this matrix dominates runtime despite attention being a small fraction of total FLOPs.

Flash Attention (Dao et al., 2022) reorganizes the computation to avoid materializing the full attention matrix in HBM. It tiles the computation to fit in SRAM, performing reads and writes to HBM only for input and final output — not intermediate scores.

Memory reduction: from O(N²) HBM usage to O(N) — critical for long context
Bandwidth reduction: 10–20× fewer bytes read/written to HBM for attention
Exact result: not an approximation — identical numerical output to standard attention
Adoption: used in all major serving frameworks and training runs as of 2024

Quantization and Bandwidth

If inference is memory-bound because of the bytes loaded per token, reducing bytes per weight directly improves throughput. This is the core of the quantization argument for inference:

Precision	Bytes/Param	70B Model Size	BW Required (24 tok/s)	Quality Impact
FP16	2	140 GB	3.35 TB/s (H100 limit)	Baseline
INT8 (W8A8)	1	70 GB	1.68 TB/s	~0.5% quality loss
INT4 (AWQ/GPTQ)	0.5	35 GB	0.84 TB/s	~1–3% quality loss
Q4_K_M (GGUF)	~0.45	~38 GB	~0.9 TB/s	Widely used, minimal loss

INT4 quantization allows a 70B model to fit on a single H100 (80 GB HBM) and simultaneously doubles the memory bandwidth headroom — enabling roughly 2× higher throughput than FP16 before other bottlenecks are hit. This is why quantized inference is the industry default for serving large models at cost-competitive prices.

Checklist: Do You Understand This?

What is the Roofline model, and what does arithmetic intensity measure?
Why is LLM decode memory-bound even on GPUs with 989 TFLOPS of compute?
Calculate the KV cache size for a 70B model at 4K context, batch size 8, FP16.
How does batching improve arithmetic intensity, and at what batch size does a 70B inference workload become compute-bound on H100?
What is the difference between prefill and decode phases, and why do they require different optimization strategies?
How does Flash Attention reduce HBM bandwidth consumption without changing the mathematical result?
Why does INT4 quantization improve inference throughput beyond simply reducing model storage size?