Information Theory for ML
Information theory was invented by Claude Shannon in 1948 to answer a deceptively simple question: how much information does a message contain? Decades later, the same framework quietly underlies most of modern machine learning. The cross-entropy loss you minimize when training a classifier is a direct application of Shannon's entropy. The KL divergence that appears in VAEs, diffusion models, and RLHF reward shaping comes from the same source. Perplexity — the standard metric for language models — is entropy in disguise. Understanding these connections reveals why the math works, not just how to use it.
Why Information Theory
Four core information-theoretic quantities appear throughout ML:
- Entropy H(X) — measures uncertainty in a probability distribution; the training target in self-supervised learning is to reduce entropy of predictions
- Cross-entropy H(p, q) — the standard loss for classification; minimizing it is equivalent to maximum likelihood estimation
- KL divergence D_KL(p‖q) — measures how different two distributions are; used in VAEs, policy optimization, and the KL penalty in RLHF
- Mutual information I(X;Y) — measures statistical dependence; drives feature selection and the information bottleneck theory of deep learning
None of these are arbitrary design choices. They all derive from one consistent framework about the fundamental relationship between probability and surprise.
Entropy
Shannon entropy measures the average surprise (or uncertainty) in a random variable X. If an event has probability p, its surprise is -log(p) — low probability events are more surprising. Entropy averages this across all possible outcomes:
Two base conventions exist. Using log₂ gives entropy in bits — the minimum number of binary questions needed to identify an outcome. Using lngives entropy in nats. PyTorch and TensorFlow use natural log by convention.
Maximum Entropy (Uniform)
A fair 8-sided die has H = log₂(8) = 3 bits. Every outcome equally likely; maximum uncertainty. The model must ask 3 yes/no questions to guess the outcome.
Minimum Entropy (Deterministic)
A fair coin has H = 1 bit. A biased coin that always lands heads has H = 0 bits — no uncertainty, no information gained from observing it.
For language models, high entropy next-token distributions mean the model is uncertain (many plausible completions). Low entropy means the model is confident. Temperature sampling directly manipulates this: temperature > 1 raises entropy (more random); temperature < 1 lowers entropy (more peaked toward the argmax).
Cross-Entropy
Cross-entropy measures how much information you need to encode samples from distribution p using a code optimized for distribution q:
In classification training, p is the true label distribution (a one-hot vector: 100% probability on the correct class) and q is the model's predicted probability distribution (softmax output). Minimizing cross-entropy loss drives the model to assign high probability to the correct class.
This is exactly the negative log-likelihood of the correct class. Minimizing cross-entropy loss is therefore equivalent to maximum likelihood estimation — the model learns to assign as much probability mass as possible to correct outcomes. This connection is why cross-entropy is the universal training loss for any model that outputs a probability distribution: classifiers, language models, and more.
KL Divergence
KL divergence measures the "extra bits" (or nats) required to encode samples from p using a code designed for q. It quantifies how different two distributions are from an information-theoretic perspective:
Key properties:
- Always ≥ 0 — zero only when p = q exactly (Gibbs' inequality)
- Asymmetric — D_KL(p‖q) ≠ D_KL(q‖p) in general; which direction matters for your application
- Decomposes cross-entropy: H(p, q) = H(p) + D_KL(p‖q)
The decomposition means minimizing cross-entropy loss (when the true entropy H(p) is fixed, as it is for one-hot labels) is exactly equivalent to minimizing KL divergence between the true distribution and the model.
Where KL Divergence Appears in ML
- VAEs: ELBO loss = reconstruction loss + D_KL(posterior ‖ prior); the KL term regularizes the latent space toward a Gaussian
- Diffusion models: Training loss bounds KL divergence between the denoising distribution and the true reverse process
- RLHF: KL penalty between the fine-tuned policy and the reference model prevents the model from drifting too far from its pre-training behavior
- Knowledge distillation: Student minimizes KL divergence from teacher's soft probability outputs
Mutual Information
Mutual information I(X;Y) measures how much knowing Y reduces uncertainty about X — equivalently, how much information X and Y share:
When X and Y are statistically independent, p(x,y) = p(x)·p(y) and I(X;Y) = 0. When knowing Y perfectly determines X, I(X;Y) = H(X) — no uncertainty remains.
In machine learning, mutual information appears in:
- Feature selection: Select features X with highest I(X; Y_label) — they carry the most information about the target
- Representation learning: Contrastive methods like SimCLR can be interpreted as maximizing mutual information between two views of the same data
- Clustering: Mutual information between cluster assignments and ground truth labels measures clustering quality (normalized MI is a standard metric)
- Information bottleneck: Explicitly trades off compression of X against prediction of Y — see next section
Perplexity
Perplexity is the standard evaluation metric for language models. It measures how surprised the model is, on average, by each token in a held-out test set:
Intuition: a perplexity of k means the model is, on average, as uncertain as if it were choosing uniformly from k equally-likely options at each step. A model with perplexity 10 is effectively choosing between 10 equally plausible next tokens. A model with perplexity 100 is much more uncertain.
| Perplexity | Interpretation | Typical Context |
|---|---|---|
| 1 | Perfect — always predicts correctly | Impossible on real data; memorisation artifact |
| 5–10 | Excellent | State-of-the-art LLMs on in-domain text (e.g. Penn Treebank ~5–8) |
| 20–50 | Good — reasonable prediction | Decent models on broader benchmarks |
| 100+ | Weak — highly uncertain | Out-of-domain text, or weak baselines |
Perplexity comparisons are only valid on the same tokenizer and test set. A model with a larger vocabulary can report lower perplexity simply because it splits text into fewer tokens per word, reducing the average cross-entropy per token. Cross-model perplexity comparisons require careful normalization.
The Information Bottleneck
The information bottleneck principle, proposed by Tishby et al., offers a theoretical account of what deep learning does during training. The goal is to compress the input X into a representation Z that:
- Retains as much information about the target Y as possible — maximize I(Z; Y)
- Discards as much irrelevant information about X as possible — minimize I(X; Z)
The interpretation for neural networks: each layer progressively compresses the input, stripping out irrelevant variation (e.g. lighting, viewpoint for an image classifier) while preserving task-relevant structure (e.g. object identity). The hidden layers are traversing the information plane — trading raw input information for task-relevant information.
Practical Consequences
- Regularization (dropout, weight decay) encourages compression — it pushes representations to discard spurious features
- Bottleneck architectures (autoencoders, VAEs) literally implement information bottleneck by design
- Representation quality for transfer learning is related to how much task-relevant information the representation preserves while being general
- The information bottleneck framework is debated — some findings (especially about the fitting/compression phases) have not fully replicated — but it remains a useful conceptual lens
Checklist: Do You Understand This?
- Can you write the formula for Shannon entropy and explain what high versus low entropy means intuitively?
- Can you derive why minimizing cross-entropy loss is equivalent to maximum likelihood estimation?
- Can you explain why KL divergence is asymmetric and give one ML context where the direction matters?
- Do you understand the relationship H(p, q) = H(p) + D_KL(p‖q) and what it implies for training?
- Can you explain what a perplexity of 20 means for a language model, and why cross-model perplexity comparison requires care?
- Can you describe the information bottleneck objective and connect it to how deep networks learn representations?