Advanced

Information Theory for ML

Information theory was invented by Claude Shannon in 1948 to answer a deceptively simple question: how much information does a message contain? Decades later, the same framework quietly underlies most of modern machine learning. The cross-entropy loss you minimize when training a classifier is a direct application of Shannon's entropy. The KL divergence that appears in VAEs, diffusion models, and RLHF reward shaping comes from the same source. Perplexity — the standard metric for language models — is entropy in disguise. Understanding these connections reveals why the math works, not just how to use it.

Why Information Theory

Four core information-theoretic quantities appear throughout ML:

Entropy H(X) — measures uncertainty in a probability distribution; the training target in self-supervised learning is to reduce entropy of predictions
Cross-entropy H(p, q) — the standard loss for classification; minimizing it is equivalent to maximum likelihood estimation
KL divergence D_KL(p‖q) — measures how different two distributions are; used in VAEs, policy optimization, and the KL penalty in RLHF
Mutual information I(X;Y) — measures statistical dependence; drives feature selection and the information bottleneck theory of deep learning

None of these are arbitrary design choices. They all derive from one consistent framework about the fundamental relationship between probability and surprise.

Entropy

Shannon entropy measures the average surprise (or uncertainty) in a random variable X. If an event has probability p, its surprise is -log(p) — low probability events are more surprising. Entropy averages this across all possible outcomes:

H(X) = -Σ p(x) · log p(x)

Two base conventions exist. Using log₂ gives entropy in bits — the minimum number of binary questions needed to identify an outcome. Using lngives entropy in nats. PyTorch and TensorFlow use natural log by convention.

Maximum Entropy (Uniform)

A fair 8-sided die has H = log₂(8) = 3 bits. Every outcome equally likely; maximum uncertainty. The model must ask 3 yes/no questions to guess the outcome.

Minimum Entropy (Deterministic)

A fair coin has H = 1 bit. A biased coin that always lands heads has H = 0 bits — no uncertainty, no information gained from observing it.

For language models, high entropy next-token distributions mean the model is uncertain (many plausible completions). Low entropy means the model is confident. Temperature sampling directly manipulates this: temperature > 1 raises entropy (more random); temperature < 1 lowers entropy (more peaked toward the argmax).

Cross-Entropy

Cross-entropy measures how much information you need to encode samples from distribution p using a code optimized for distribution q:

H(p, q) = -Σ p(x) · log q(x)

In classification training, p is the true label distribution (a one-hot vector: 100% probability on the correct class) and q is the model's predicted probability distribution (softmax output). Minimizing cross-entropy loss drives the model to assign high probability to the correct class.

L = -Σᵢ yᵢ · log(ŷᵢ) where y is one-hot true label, ŷ is model output For one correct class k: L = -log(ŷₖ)

This is exactly the negative log-likelihood of the correct class. Minimizing cross-entropy loss is therefore equivalent to maximum likelihood estimation — the model learns to assign as much probability mass as possible to correct outcomes. This connection is why cross-entropy is the universal training loss for any model that outputs a probability distribution: classifiers, language models, and more.

KL Divergence

KL divergence measures the "extra bits" (or nats) required to encode samples from p using a code designed for q. It quantifies how different two distributions are from an information-theoretic perspective:

D_KL(p ‖ q) = Σ p(x) · log(p(x) / q(x))

Key properties:

Always ≥ 0 — zero only when p = q exactly (Gibbs' inequality)
Asymmetric — D_KL(p‖q) ≠ D_KL(q‖p) in general; which direction matters for your application
Decomposes cross-entropy: H(p, q) = H(p) + D_KL(p‖q)

The decomposition means minimizing cross-entropy loss (when the true entropy H(p) is fixed, as it is for one-hot labels) is exactly equivalent to minimizing KL divergence between the true distribution and the model.

Where KL Divergence Appears in ML

VAEs: ELBO loss = reconstruction loss + D_KL(posterior ‖ prior); the KL term regularizes the latent space toward a Gaussian
Diffusion models: Training loss bounds KL divergence between the denoising distribution and the true reverse process
RLHF: KL penalty between the fine-tuned policy and the reference model prevents the model from drifting too far from its pre-training behavior
Knowledge distillation: Student minimizes KL divergence from teacher's soft probability outputs

Mutual Information

Mutual information I(X;Y) measures how much knowing Y reduces uncertainty about X — equivalently, how much information X and Y share:

I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) = D_KL(p(x,y) ‖ p(x)·p(y))

When X and Y are statistically independent, p(x,y) = p(x)·p(y) and I(X;Y) = 0. When knowing Y perfectly determines X, I(X;Y) = H(X) — no uncertainty remains.

In machine learning, mutual information appears in:

Feature selection: Select features X with highest I(X; Y_label) — they carry the most information about the target
Representation learning: Contrastive methods like SimCLR can be interpreted as maximizing mutual information between two views of the same data
Clustering: Mutual information between cluster assignments and ground truth labels measures clustering quality (normalized MI is a standard metric)
Information bottleneck: Explicitly trades off compression of X against prediction of Y — see next section

Perplexity

Perplexity is the standard evaluation metric for language models. It measures how surprised the model is, on average, by each token in a held-out test set:

Perplexity = exp(H) = exp(average cross-entropy over test tokens)

Intuition: a perplexity of k means the model is, on average, as uncertain as if it were choosing uniformly from k equally-likely options at each step. A model with perplexity 10 is effectively choosing between 10 equally plausible next tokens. A model with perplexity 100 is much more uncertain.

Perplexity	Interpretation	Typical Context
1	Perfect — always predicts correctly	Impossible on real data; memorisation artifact
5–10	Excellent	State-of-the-art LLMs on in-domain text (e.g. Penn Treebank ~5–8)
20–50	Good — reasonable prediction	Decent models on broader benchmarks
100+	Weak — highly uncertain	Out-of-domain text, or weak baselines

Perplexity comparisons are only valid on the same tokenizer and test set. A model with a larger vocabulary can report lower perplexity simply because it splits text into fewer tokens per word, reducing the average cross-entropy per token. Cross-model perplexity comparisons require careful normalization.

The Information Bottleneck

The information bottleneck principle, proposed by Tishby et al., offers a theoretical account of what deep learning does during training. The goal is to compress the input X into a representation Z that:

Retains as much information about the target Y as possible — maximize I(Z; Y)
Discards as much irrelevant information about X as possible — minimize I(X; Z)

Objective: max [I(Z;Y) - β · I(X;Z)] β controls compression-accuracy tradeoff

The interpretation for neural networks: each layer progressively compresses the input, stripping out irrelevant variation (e.g. lighting, viewpoint for an image classifier) while preserving task-relevant structure (e.g. object identity). The hidden layers are traversing the information plane — trading raw input information for task-relevant information.

Practical Consequences

Regularization (dropout, weight decay) encourages compression — it pushes representations to discard spurious features
Bottleneck architectures (autoencoders, VAEs) literally implement information bottleneck by design
Representation quality for transfer learning is related to how much task-relevant information the representation preserves while being general
The information bottleneck framework is debated — some findings (especially about the fitting/compression phases) have not fully replicated — but it remains a useful conceptual lens

Checklist: Do You Understand This?

Can you write the formula for Shannon entropy and explain what high versus low entropy means intuitively?
Can you derive why minimizing cross-entropy loss is equivalent to maximum likelihood estimation?
Can you explain why KL divergence is asymmetric and give one ML context where the direction matters?
Do you understand the relationship H(p, q) = H(p) + D_KL(p‖q) and what it implies for training?
Can you explain what a perplexity of 20 means for a language model, and why cross-model perplexity comparison requires care?
Can you describe the information bottleneck objective and connect it to how deep networks learn representations?