Advanced

Mixture of Experts (MoE) — How It Works

Mixture of Experts (MoE) is an architectural technique that allows a model to have far more parameters than it activates for any given token. The central idea: decouple model capacity from compute cost. A model can have 100 billion parameters but activate only 10 billion per token — giving you the expressiveness of a large model at the inference cost of a smaller one.

MoE is now used in several of the world's most capable models: GPT-4 (reported), Mixtral 8×7B and 8×22B, DeepSeek-V2/V3, and Google's Gemini 1.5. Understanding it is essential for understanding modern frontier architecture.

Input Token

Token embedding

Attention Layer

Multi-Head Attention

Router

Gate: select top-k experts

Experts (sparse)

Expert 1

Expert 3

Expert N (inactive)

Output

Weighted sum of active experts

MoE architecture — each token is routed to 2–4 experts out of potentially hundreds; inactive experts consume no compute

The Basic Idea — Experts Replace the Dense FFN

In a standard transformer block, after the attention layer comes a Feed-Forward Network (FFN): two linear projections with a nonlinearity between them, applied independently to each token. The FFN is typically 4× the model's hidden dimension — it is the largest component of a transformer block by parameter count.

In an MoE transformer, the single dense FFN is replaced with N expert FFNs. Each expert is an independent FFN with the same shape as the dense FFN would have been. For each token, a router selects K of the N experts to activate. Only those K experts process the token; the rest are ignored.

Sparse Activation — The Core Efficiency Gain

Dense Model (baseline)

Every parameter activates for every token. Total params = Active params. Compute scales linearly with model size.

MoE Model (N=8, K=2)

8 experts exist; 2 activate per token. Total FFN params = 8×, but active FFN params = 2×. You get 8× the capacity at roughly 2× the FFN compute cost.

The Router — How Experts Are Selected

The router is a small linear layer that takes each token's hidden state vector as input and produces N logits — one per expert. These logits are converted to routing weights via softmax, and the top-K experts (by weight) are selected.

The token is then processed by each of the K selected experts independently. The K expert outputs are combined using the routing weights as a weighted sum:

output = Σ (routing_weight_i × Expert_i(token)) for i in top-K experts

The routing weights thus control how much each selected expert influences the final output. If one expert has a much higher routing weight than another, it dominates the output for that token.

Load Balancing — The Critical Training Challenge

A naive MoE router has a serious problem: it collapses. If one expert starts producing slightly better outputs than others early in training, the router learns to route more tokens to it. That expert then gets more gradient signal and improves faster. The other experts see few tokens, receive little training signal, and stagnate. Eventually, almost all tokens go to 1–2 experts — the model effectively becomes a dense model again, but wastes the parameters of the unused experts.

This is called expert collapse, and avoiding it is one of the central challenges in training MoE models.

Auxiliary Load-Balancing Loss

An additional loss term penalizes uneven expert utilization. It encourages the router to distribute tokens roughly uniformly across experts. Added to the main language modeling loss with a small coefficient.

Capacity Factor

Each expert has a maximum "capacity" — the maximum number of tokens it can process per batch. Tokens routed to a full expert are dropped (or handled by the next-best expert). This prevents overloading hot experts.

Expert Dropout

Randomly drop experts during training (similar to neuron dropout). Forces the router to distribute load because any single expert might be unavailable, preventing over-specialization.

Shared Experts (DeepSeek style)

Some experts always activate for every token (shared/global experts), while the remaining experts are routed. This ensures universal knowledge is always available while specialized experts handle token-specific knowledge.

Switch Transformer (2021) — First Large-Scale MoE LLM

Google's Switch Transformer (2021) was the first demonstration that MoE could scale efficiently to very large language models. It used K=1 (route each token to exactly one expert — hard routing), which simplified the load-balancing problem and reduced compute overhead. The paper showed that for the same compute budget, Switch Transformer achieved better performance than a dense T5 baseline. The key finding: capacity (total parameters) matters even if not all of it activates, because different tokens can use different experts to specialize their processing.

MoE vs Dense — Tradeoffs in Practice

Dimension	Dense Model	MoE Model
Parameters per token activated	100% (all params)	K/N fraction (e.g., 25% if K=2, N=8)
Total memory required	Model size	Full model size (all experts must fit in memory)
Inference compute (FLOPs)	Scales with total params	Scales with active params only
Distributed inference complexity	Standard tensor/pipeline parallelism	Expert parallelism required; all-to-all communication between GPUs
Training stability	Well-understood, stable	Requires careful load balancing; can collapse
Quality per active FLOP	Baseline	Higher — more total capacity helps reasoning

The practical summary: MoE gives you better quality per FLOP at the cost of more total memory and higher distributed system complexity. For organizations running massive inference workloads (where compute cost per token matters enormously), MoE is a compelling choice. For teams running models on a single machine or needing predictable, simple inference, dense models are easier to operate.

Checklist: Do You Understand This?

What does it mean to "decouple model capacity from compute cost" in MoE?
In an MoE transformer, what component does the set of N expert FFNs replace?
Describe how the router selects which experts to activate for a given token.
What is expert collapse, and why does it happen during training?
Name two techniques used to encourage load balancing across experts during training.
What did the Switch Transformer demonstrate, and what routing approach did it use?
In a MoE model with 8 experts and K=2, what fraction of FFN parameters are active per token, and what fraction must fit in memory for inference?
Why does distributed inference of MoE models require "expert parallelism" and inter-GPU communication that dense models do not?