Advanced

FLOPS, MFU & Compute Efficiency

FLOPS (Floating Point Operations Per Second) is the most commonly cited GPU metric, but the raw number on a spec sheet rarely corresponds to what you actually get during training or inference. Understanding the gap between theoretical and achieved FLOPS — and how to close it — is fundamental to efficient AI infrastructure.

Counting FLOPs for Transformers

Before you can measure efficiency, you need to know how many FLOPs a computation requires. For transformers, there are well-established rules of thumb.

Matrix Multiply (the core operation)

Matrix multiply: (M × K) × (K × N)

FLOPs = 2 × M × K × N

(The factor of 2 accounts for one multiply and one add per output element. Each output element requires K multiply-adds, for a total of M × N × K multiply-adds = 2MKN FLOPs.)

Transformer Forward Pass

For a transformer with N non-embedding parameters, each forward pass over one token requires approximately 2N FLOPs. The dominant contribution comes from the linear projections in attention (Q, K, V, O projections) and the feed-forward network layers. Embedding lookups are excluded because they are memory reads, not floating-point operations.

Component	FLOPs per token	Notes
Q, K, V projections (per layer)	6 × d_model²	3 projections × 2 × d² each
Attention scores (per layer)	4 × T × d_model	Scales with sequence length T
Output projection (per layer)	2 × d_model²	One linear layer
FFN (per layer)	~8 × d_model²	Typically 4× intermediate expansion, 2 matrices
Total (approximate)	~2N per token	N = non-embedding parameter count

Training FLOPs: The 6N Rule

Training requires a forward pass, a backward pass, and an optimizer update. The backward pass requires approximately 2× the FLOPs of the forward pass (computing gradients for both weights and activations). This gives the standard approximation:

Training FLOPs per token ≈ 6N

where N = number of non-embedding parameters

Total training FLOPs = 6 × N × D

D = total training tokens

Example: GPT-3 (175B params, 300B tokens) = 6 × 175×10⁹ × 300×10⁹ ≈ 3.14 × 10²³ FLOPs

This 6N rule is the standard used in papers and compute estimates. It slightly underestimates true FLOPs (it excludes attention softmax, layer norm, and longer sequence attention quadratic terms) but is accurate enough for estimation and comparison.

Model FLOPs Utilization (MFU)

MFU is the single most useful metric for evaluating training efficiency. It measures how much of the hardware's theoretical peak FLOPS is being used productively:

MFU = (Achieved FLOPs) / (Hardware Peak FLOPs)

Achieved FLOPs = (6 × N × tokens_per_second) for training

Hardware Peak = e.g., 989 TFLOPS BF16 on H100

MFU of 50% on H100 means the hardware is executing at 495 TFLOPS of productive model math — the rest of the time the GPU is doing overhead work (communication, memory allocation, kernel launches) or is simply idle. Published MFU values from major training runs:

Model	Hardware	MFU	Notes
PaLM (540B)	6144 TPU v4	46%	Reported in PaLM paper (2022)
GPT-NeoX (20B)	A100 cluster	38–44%	EleutherAI open training
Llama 3 (405B)	H100 clusters	~38–43%	Meta reported ~43% at scale
DeepSeek-V3 (671B MoE)	2048× H800	~45%+	FP8 training; efficient MoE routing

Why MFU Falls Below 100%

Even with perfect software, distributed training across many GPUs creates unavoidable overhead. Multiple factors compound to reduce MFU:

Communication overhead

All-reduce synchronization during backward pass — all GPUs must wait for gradient aggregation before the next forward pass begins. With 1000+ GPUs, even a 1% communication overhead compounds. Can be partially hidden by overlapping compute with communication.

Memory bandwidth limits

Even during compute-bound training, individual kernels can become memory-bound if they have low arithmetic intensity (e.g., layer norm, embedding lookups). These kernels leave Tensor Cores idle while waiting for HBM reads.

Kernel launch overhead

Each PyTorch operation launches a separate GPU kernel. The CPU must submit these in sequence; short operations can leave the GPU stalled between kernel launches. Operator fusion combines many small kernels into one, eliminating this overhead.

Pipeline bubbles

Pipeline parallelism creates idle time at the start and end of each micro-batch (the "bubble") where GPUs are waiting for data from the previous stage. Larger micro-batch counts reduce bubble fraction but increase memory pressure.

Strategies to Improve MFU

Mixed Precision (BF16 + FP32)

Use BF16 for activations and weight computation (4× Tensor Core throughput vs FP32); keep FP32 for optimizer states and loss scaling. Standard since 2021.

Operator Fusion

Combine multiple small GPU kernels (e.g., LayerNorm → Dropout → Residual add) into a single fused kernel. Eliminates HBM round-trips between operations. Flash Attention is the canonical example.

Gradient Checkpointing

Instead of storing all activations for the backward pass, recompute them on-demand. Reduces memory footprint by ~√layers, enabling larger batch sizes — which increases MFU by improving arithmetic intensity.

FP8 Training

H100 supports FP8 formats (E4M3 and E5M2). FP8 Tensor Cores provide ~2× throughput vs BF16. Requires per-tensor scaling factors to avoid underflow. Used by DeepSeek-V3 and Llama 3.1 training.

Compute-Communication Overlap

Structure distributed training so gradient all-reduce runs concurrently with the forward pass of the next micro-batch. Requires careful pipeline scheduling (e.g., 1F1B in GPipe/Megatron-LM).

FP8 Training in Practice

FP8 (8-bit floating point) was introduced with H100 and represents the current frontier of training efficiency. NVIDIA supports two FP8 formats: E4M3 (4 exponent bits, 3 mantissa) for weights and activations, and E5M2 (5 exponent, 2 mantissa) for gradients where range matters more than precision.

FP8 engineering challenges

FP8 has a very narrow dynamic range. Without careful scaling, small gradient values underflow to zero (vanishing gradients) or large values overflow (NaNs). Production FP8 training uses delayed scaling: track the maximum absolute value of each tensor over recent steps and use it to set the per-tensor scale factor. DeepSeek-V3 reports this adds 1–2% training overhead but enables 2× higher throughput.

Measuring MFU in Practice

To measure MFU during training, you need two numbers: the theoretical peak FLOPS of your hardware configuration and the actual training throughput in tokens per second.

Example: measuring MFU for a 7B model on 8× H100

tokens_per_second = batch_size × seq_len / step_time_seconds

achieved_flops = 6 × N × tokens_per_second

hardware_peak = 8 × 989e12 # 8 H100s, BF16 peak

mfu = achieved_flops / hardware_peak

# If step_time = 1.2s, batch_size = 8, seq_len = 4096:
# tokens/s = 8 × 4096 / 1.2 = 27,307
# achieved_flops = 6 × 7e9 × 27,307 = 1.147e15
# hardware_peak = 8 × 989e12 = 7.912e15
# MFU = 1.147e15 / 7.912e15 ≈ 14.5% (needs optimization!)

NVIDIA Nsight Systems and Nsight Compute provide kernel-level profiling to identify which operations are bottlenecks. PyTorch Profiler (with Chrome trace export) shows CPU-GPU synchronization points, idle gaps, and kernel duration breakdowns.

Checklist: Do You Understand This?

What does "2N FLOPs per token" mean for a transformer forward pass, and why are embeddings excluded?
Apply the 6N rule: estimate total training FLOPs for a 7B model trained on 1 trillion tokens.
Define MFU and explain what it measures. What is a good MFU for a well-tuned training run on H100?
List three independent reasons why achieved FLOPS falls below theoretical peak during distributed training.
What is operator fusion, and how does it improve MFU compared to running individual ops sequentially?
Why does FP8 training achieve ~2× throughput vs BF16, and what is the main engineering risk?
How would you measure MFU for a training run given only step time, batch size, sequence length, and model parameter count?