Memory Bandwidth ā The Real Bottleneck
When engineers first look at GPU specs, they focus on FLOPS ā the compute throughput. But for LLM inference, the limiting factor is almost never FLOPS. It is memory bandwidth: how fast the GPU can load model weights from HBM (High Bandwidth Memory) into the compute units. Understanding this shapes every inference optimization decision, from quantization to batching to Flash Attention.
The Roofline Model
The Roofline model is a visual performance model that predicts whether a computation is limited by compute or memory bandwidth. It has two key parameters:
- Peak compute (FLOPS): the maximum arithmetic throughput of the hardware (e.g., 989 TFLOPS BF16 on H100)
- Peak memory bandwidth: the maximum rate of data movement from HBM (e.g., 3.35 TB/s on H100)
The arithmetic intensity of a kernel is the ratio of floating-point operations performed to bytes of memory accessed:
Arithmetic Intensity = FLOPs performed / Bytes read from memory
Ridge point = Peak FLOPS / Peak Bandwidth
H100 ridge point = 989 TFLOPS / 3.35 TB/s ā 295 FLOPs/byte
If a kernel's arithmetic intensity is above the ridge point, it is compute-bound ā adding more bandwidth would not help; it is already using all the FLOPS. If arithmetic intensity is below the ridge point, it is memory-bound ā adding more compute would not help; it is waiting for data from HBM.
Why LLM Decoding Is Memory-Bound
During autoregressive token generation (the decoding phase), the model generates exactly one token per forward pass. Consider what that means for a 70B parameter model in FP16:
| Item | Value |
|---|---|
| Model parameters | 70 billion |
| Weight size in FP16 | 70B Ć 2 bytes = 140 GB |
| FLOPs per token (decode) | ~2 Ć 70B = 140 GFLOPs |
| Arithmetic intensity (decode) | 140 GFLOPs / 140 GB ā 1 FLOP/byte |
| H100 ridge point | ~295 FLOPs/byte |
| Result | ~295Ć below ridge point ā massively memory-bound |
The ceiling for single-request inference is determined by bandwidth alone. H100 can load 140 GB of weights at 3.35 TB/s in about 42ms ā giving a theoretical maximum of roughly 24 tokens/second for Llama 70B, regardless of how many FLOPS the chip has. The GPU's 989 TFLOPS are almost entirely idle during single-request decode.
KV Cache ā The Memory Complication
Transformer attention requires access to all previous tokens' key and value vectors when generating each new token. Without caching, the model would recompute all previous keys and values on every step ā wasting enormous compute. The KV cache stores these tensors in HBM, trading memory for compute.
But KV cache adds significantly to memory pressure:
KV cache size formula
KV cache bytes = 2 Ć num_layers Ć num_heads Ć head_dim Ć seq_len Ć batch_size Ć bytes_per_element
For Llama 3 70B at 8K context, batch size 1, FP16: ā 2 Ć 80 Ć 8 Ć 128 Ć 8192 Ć 1 Ć 2 ā 21 GB. That is 21 GB of HBM consumed just for one conversation ā on top of the 140 GB model weights. H100 has 80 GB total. This is why Llama 70B requires 2Ć H100s for inference at meaningful context lengths.
For very long contexts (128K+ tokens), the KV cache can exceed the model weights in size. This is a key driver of techniques like sliding window attention, GQA (Grouped Query Attention), and MLA (Multi-head Latent Attention, used in DeepSeek-V2/V3) that compress or reduce KV cache storage.
Batching ā The Primary Solution
The most effective fix for memory-bound inference is batching: processing multiple requests simultaneously. When 32 requests are batched together, the GPU loads model weights once and produces 32 output tokens in the same time it would produce 1 token for a single request. Arithmetic intensity scales with batch size.
| Batch Size | Arithmetic Intensity | Bound | Tokens/sec (H100, 70B FP16) |
|---|---|---|---|
| 1 | ~1 FLOP/byte | Memory-bound | ~24 tok/s |
| 16 | ~16 FLOPs/byte | Memory-bound | ~380 tok/s total |
| 256 | ~256 FLOPs/byte | Near ridge point | ~5,000+ tok/s total |
| 512+ | >295 FLOPs/byte | Compute-bound | Scales with FLOPS |
Continuous batching (used by vLLM, TGI) goes further: instead of waiting for all requests in a batch to finish before starting new ones, it dynamically slots new requests into the batch as slots free up. This dramatically improves GPU utilization for production serving.
Prefill vs Decode ā Two Different Problems
LLM serving has two fundamentally different phases with different performance profiles:
Prefill and decode have opposite performance characteristics
Prefill Phase
All input tokens are processed simultaneously. For a 2000-token prompt, the GPU computes attention over 2000 tokens in parallel ā high arithmetic intensity, compute-bound.
Optimization: maximize batch size of prompts, use Flash Attention for memory efficiency, use tensor parallelism to distribute attention computation.
Decode Phase
One token generated per forward pass. Arithmetic intensity proportional to batch size. At small batches, almost entirely memory-bound ā most FLOPS sit idle.
Optimization: maximize batch size via continuous batching, quantize weights to reduce bytes loaded, use speculative decoding to amortize memory reads.
Flash Attention ā IO-Aware Attention
Standard attention computes the full NĆN attention score matrix (where N is sequence length) and materializes it in HBM. For a 32K context with 64 heads, this is enormous ā and the reading and writing of this matrix dominates runtime despite attention being a small fraction of total FLOPs.
Flash Attention (Dao et al., 2022) reorganizes the computation to avoid materializing the full attention matrix in HBM. It tiles the computation to fit in SRAM, performing reads and writes to HBM only for input and final output ā not intermediate scores.
- Memory reduction: from O(N²) HBM usage to O(N) ā critical for long context
- Bandwidth reduction: 10ā20Ć fewer bytes read/written to HBM for attention
- Exact result: not an approximation ā identical numerical output to standard attention
- Adoption: used in all major serving frameworks and training runs as of 2024
Quantization and Bandwidth
If inference is memory-bound because of the bytes loaded per token, reducing bytes per weight directly improves throughput. This is the core of the quantization argument for inference:
| Precision | Bytes/Param | 70B Model Size | BW Required (24 tok/s) | Quality Impact |
|---|---|---|---|---|
| FP16 | 2 | 140 GB | 3.35 TB/s (H100 limit) | Baseline |
| INT8 (W8A8) | 1 | 70 GB | 1.68 TB/s | ~0.5% quality loss |
| INT4 (AWQ/GPTQ) | 0.5 | 35 GB | 0.84 TB/s | ~1ā3% quality loss |
| Q4_K_M (GGUF) | ~0.45 | ~38 GB | ~0.9 TB/s | Widely used, minimal loss |
INT4 quantization allows a 70B model to fit on a single H100 (80 GB HBM) and simultaneously doubles the memory bandwidth headroom ā enabling roughly 2Ć higher throughput than FP16 before other bottlenecks are hit. This is why quantized inference is the industry default for serving large models at cost-competitive prices.
Checklist: Do You Understand This?
- What is the Roofline model, and what does arithmetic intensity measure?
- Why is LLM decode memory-bound even on GPUs with 989 TFLOPS of compute?
- Calculate the KV cache size for a 70B model at 4K context, batch size 8, FP16.
- How does batching improve arithmetic intensity, and at what batch size does a 70B inference workload become compute-bound on H100?
- What is the difference between prefill and decode phases, and why do they require different optimization strategies?
- How does Flash Attention reduce HBM bandwidth consumption without changing the mathematical result?
- Why does INT4 quantization improve inference throughput beyond simply reducing model storage size?