Mixture of Experts (MoE) — How It Works
Mixture of Experts (MoE) is an architectural technique that allows a model to have far more parameters than it activates for any given token. The central idea: decouple model capacity from compute cost. A model can have 100 billion parameters but activate only 10 billion per token — giving you the expressiveness of a large model at the inference cost of a smaller one.
MoE is now used in several of the world's most capable models: GPT-4 (reported), Mixtral 8×7B and 8×22B, DeepSeek-V2/V3, and Google's Gemini 1.5. Understanding it is essential for understanding modern frontier architecture.
The Basic Idea — Experts Replace the Dense FFN
In a standard transformer block, after the attention layer comes a Feed-Forward Network (FFN): two linear projections with a nonlinearity between them, applied independently to each token. The FFN is typically 4× the model's hidden dimension — it is the largest component of a transformer block by parameter count.
In an MoE transformer, the single dense FFN is replaced with N expert FFNs. Each expert is an independent FFN with the same shape as the dense FFN would have been. For each token, a router selects K of the N experts to activate. Only those K experts process the token; the rest are ignored.
Every parameter activates for every token. Total params = Active params. Compute scales linearly with model size.
8 experts exist; 2 activate per token. Total FFN params = 8×, but active FFN params = 2×. You get 8× the capacity at roughly 2× the FFN compute cost.
The Router — How Experts Are Selected
The router is a small linear layer that takes each token's hidden state vector as input and produces N logits — one per expert. These logits are converted to routing weights via softmax, and the top-K experts (by weight) are selected.
The token is then processed by each of the K selected experts independently. The K expert outputs are combined using the routing weights as a weighted sum:
The routing weights thus control how much each selected expert influences the final output. If one expert has a much higher routing weight than another, it dominates the output for that token.
Load Balancing — The Critical Training Challenge
A naive MoE router has a serious problem: it collapses. If one expert starts producing slightly better outputs than others early in training, the router learns to route more tokens to it. That expert then gets more gradient signal and improves faster. The other experts see few tokens, receive little training signal, and stagnate. Eventually, almost all tokens go to 1–2 experts — the model effectively becomes a dense model again, but wastes the parameters of the unused experts.
This is called expert collapse, and avoiding it is one of the central challenges in training MoE models.
An additional loss term penalizes uneven expert utilization. It encourages the router to distribute tokens roughly uniformly across experts. Added to the main language modeling loss with a small coefficient.
Each expert has a maximum "capacity" — the maximum number of tokens it can process per batch. Tokens routed to a full expert are dropped (or handled by the next-best expert). This prevents overloading hot experts.
Randomly drop experts during training (similar to neuron dropout). Forces the router to distribute load because any single expert might be unavailable, preventing over-specialization.
Some experts always activate for every token (shared/global experts), while the remaining experts are routed. This ensures universal knowledge is always available while specialized experts handle token-specific knowledge.
Switch Transformer (2021) — First Large-Scale MoE LLM
Google's Switch Transformer (2021) was the first demonstration that MoE could scale efficiently to very large language models. It used K=1 (route each token to exactly one expert — hard routing), which simplified the load-balancing problem and reduced compute overhead. The paper showed that for the same compute budget, Switch Transformer achieved better performance than a dense T5 baseline. The key finding: capacity (total parameters) matters even if not all of it activates, because different tokens can use different experts to specialize their processing.
MoE vs Dense — Tradeoffs in Practice
| Dimension | Dense Model | MoE Model |
|---|---|---|
| Parameters per token activated | 100% (all params) | K/N fraction (e.g., 25% if K=2, N=8) |
| Total memory required | Model size | Full model size (all experts must fit in memory) |
| Inference compute (FLOPs) | Scales with total params | Scales with active params only |
| Distributed inference complexity | Standard tensor/pipeline parallelism | Expert parallelism required; all-to-all communication between GPUs |
| Training stability | Well-understood, stable | Requires careful load balancing; can collapse |
| Quality per active FLOP | Baseline | Higher — more total capacity helps reasoning |
The practical summary: MoE gives you better quality per FLOP at the cost of more total memory and higher distributed system complexity. For organizations running massive inference workloads (where compute cost per token matters enormously), MoE is a compelling choice. For teams running models on a single machine or needing predictable, simple inference, dense models are easier to operate.
Checklist: Do You Understand This?
- What does it mean to "decouple model capacity from compute cost" in MoE?
- In an MoE transformer, what component does the set of N expert FFNs replace?
- Describe how the router selects which experts to activate for a given token.
- What is expert collapse, and why does it happen during training?
- Name two techniques used to encourage load balancing across experts during training.
- What did the Switch Transformer demonstrate, and what routing approach did it use?
- In a MoE model with 8 experts and K=2, what fraction of FFN parameters are active per token, and what fraction must fit in memory for inference?
- Why does distributed inference of MoE models require "expert parallelism" and inter-GPU communication that dense models do not?