Advanced

Mistral & Mixtral Internals

Mistral AI, founded in Paris in 2023, produced a series of open-weight models that consistently outperformed larger models from established labs. Mistral 7B outperformed Llama 2 13B. Mixtral 8×7B outperformed Llama 2 70B while using 6× less compute per forward pass. These results were not accidental — they followed from specific, deliberate architectural choices that are now widely copied across the open-weight ecosystem.

Mistral 7B (2023) — Efficient Attention at 7 Billion Parameters

Mistral 7B shipped with two architectural innovations that distinguished it from Llama 2 at the same parameter scale:Sliding Window Attention (SWA) and Grouped Query Attention (GQA). Together, they allow Mistral 7B to handle longer contexts more efficiently and serve faster with a smaller KV cache footprint.

Sliding Window Attention (SWA)

Standard full self-attention has quadratic complexity: each token attends to all previous tokens. For a sequence of length N, attention requires O(N²) compute and O(N²) memory. This makes long contexts prohibitively expensive.

Sliding Window Attention restricts each token to attending only to the W previous tokens(the window). Mistral uses W = 4,096. The attention pattern is now O(N × W) — linear in sequence length rather than quadratic. For short sequences, this is similar to full attention. For long sequences (N > W), it is dramatically cheaper.

How Long-Range Information Still Flows

A natural concern: if each token only attends to W=4,096 previous tokens, how does a token at position 10,000 "know about" a token at position 1?

The answer is depth. In a 32-layer model, information flows transitively. Layer 1 gives position 4,097 access to positions 1–4,096. Layer 2 gives position 8,192 access to information that already contains position 1 (via position 4,097). After k layers, a token can effectively "see" information from up to k × W positions ago. With 32 layers and W=4,096, the effective receptive field is 32 × 4,096 = 131,072 tokens — far beyond the window size.

The cost: SWA provides access to long-range context indirectly, through the chain of intermediate tokens at each layer. This may compress or lose fine-grained long-range dependencies that full attention would capture directly. For most practical tasks — summarization, QA, code — this indirect access is sufficient.

GQA in Mistral — KV Cache Compression

Mistral 7B uses 8 KV heads shared across 32 query heads — a 4:1 ratio of query heads to KV heads. This is Grouped Query Attention (GQA), which Llama 2 only used at 70B scale. By applying GQA at 7B, Mistral 7B reduces KV cache memory by 4× compared to standard multi-head attention at the same parameter count.

Smaller KV cache means faster inference (less memory bandwidth), larger batch sizes (more requests per GPU), and longer effective context windows within the same memory budget. For a serving infrastructure operator, these translate directly to lower cost per token.

Mixtral 8×7B (2023) — MoE Applied to Mistral

Mixtral 8×7B is a Mixture of Experts version of Mistral 7B. The architecture replaces each dense FFN block with 8 expert FFNs, routing each token to the best 2 (K=2). Key numbers:

Metric	Mixtral 8×7B	Llama 2 70B (comparison)
Total parameters	46.7B	70B
Active parameters per token	12.9B	70B
Compute per token (FLOPs)	~6× less than Llama 2 70B	Baseline
Quality (benchmark average)	Beats Llama 2 70B on most tasks	Baseline

The result: Mixtral 8×7B delivers better quality than Llama 2 70B at a fraction of the inference cost. The reason is that the 8 specialized expert FFNs collectively represent more capacity than a single dense 70B FFN, even though only 2 activate per token. Routing allows the model to apply specialized knowledge to each token.

The practical constraint: you still need to store all 46.7B parameters in memory — all 8 experts must reside on GPU even though only 2 activate. Mixtral 8×7B requires approximately 90GB of GPU memory in FP16, meaning you need 2–4 GPUs for comfortable inference. But the throughput (tokens per second) is much higher than a dense 70B model on the same hardware.

Mixtral 8×22B (2024) — Scaling Up

Mixtral 8×22B extends the same architecture to a larger scale: 141B total parameters, 39B active per token. It was designed to compete with GPT-3.5 and early GPT-4 class models, with particular strength in code and mathematics. On several coding benchmarks (HumanEval, MBPP) it matches or exceeds GPT-3.5. The Apache 2.0 license allows unrestricted commercial use — a notable contrast to Llama's more restrictive terms at the time.

Mistral Nemo (2024) — The Tekken Tokenizer

Mistral Nemo (12B parameters) introduced the Tekken tokenizer with a 131,072-token vocabulary — among the largest vocabularies of any production model. A larger vocabulary means better tokenization efficiency for non-English text, code with unusual syntax, and technical documents with domain-specific terms. The Tekken tokenizer reportedly encodes non-English languages 4–6× more efficiently than GPT tokenizers for some languages (e.g., Arabic, CJK scripts). Mistral Nemo was designed for enterprise deployment and is available with an Apache 2.0 license.

The Architectural Pattern Mistral Established

Mistral established a template that most serious open-weight models now follow:

RoPE

Rotary position embeddings for context extension. Used by Llama, Qwen, Phi, DeepSeek, Gemma.

GQA

Grouped Query Attention for KV cache efficiency. Now universal across open-weight models of all sizes.

SWA (optional)

Sliding Window Attention for long-context efficiency. Less universal — many models use full attention with RoPE scaling instead.

Checklist: Do You Understand This?

What is Sliding Window Attention, and how does it reduce the computational complexity of attention from O(N²) to O(N×W)?
If each token only attends to the W=4,096 most recent tokens, how can the model still access information from far earlier in the sequence?
How does GQA reduce KV cache memory in Mistral 7B, and what ratio of query heads to KV heads does it use?
Mixtral 8×7B has 46.7B total parameters but only 12.9B active per token. Explain this gap and why it matters for inference cost.
Despite lower active parameters, Mixtral 8×7B outperforms Llama 2 70B. What architectural property explains this?
Why must all expert weights reside in GPU memory even though only K=2 experts activate per token?
What does a larger tokenizer vocabulary (like Tekken's 131K) improve, and for which types of content does it matter most?