Advanced

Probability & Statistics for ML

Machine learning models do not output facts — they output distributions. A language model assigns a probability to every possible next token. A classifier assigns a probability to each class. An image generator samples from a learned distribution over pixel values. Understanding probability is not optional for working seriously with ML systems; it is the language in which those systems are defined, trained, and evaluated.

This page covers the probability and statistics concepts that appear most frequently in ML theory and practice — from the definition of a random variable to the connection between maximum likelihood estimation and cross-entropy loss.

Why Probability in ML

The world is uncertain, and data is noisy. A discriminative model trained to classify images never sees exactly the same input twice. A language model generating text must choose among many plausible continuations, not just one correct answer. Probability gives us the mathematical tools to represent, reason about, and quantify that uncertainty.

Three specific reasons probability is central to ML:

Training objectives are probabilistic. The most common loss functions — cross-entropy, negative log-likelihood, KL divergence — are derived from probability theory. They are not arbitrary engineering choices; they emerge directly from maximum likelihood estimation.
Model outputs are distributions. Softmax converts logits to a probability distribution over classes. Temperature sampling draws from a categorical distribution over tokens. Knowing the distribution lets you quantify confidence, not just return a point estimate.
Uncertainty quantification matters in production. A model that says "I am 99% confident this is class A" failing on edge cases is dangerous without calibration. Knowing whether model confidence correlates with accuracy requires probabilistic reasoning.

Random Variables & Distributions

A random variable X is a variable whose value is determined by a random process. Discrete random variables take on countable values (e.g., which token appears next); continuous random variables take on values in a continuous range (e.g., the value of a network activation).

A probability mass function (PMF) P(X = x) gives the probability that a discrete variable equals a specific value. All probabilities must be non-negative and sum to 1. A probability density function (PDF) f(x) describes continuous distributions — probabilities are areas under the curve, not point values. The cumulative distribution function (CDF) F(x) = P(X ≤ x) gives the probability that the variable is at most x.

Key distributions you will encounter constantly:

Distribution	Type	Parameters	Where in ML
Bernoulli	Discrete	p ∈ [0,1]	Binary classification, dropout (keep/drop each neuron)
Categorical	Discrete	p₁…pK summing to 1	Token prediction, multiclass classification, softmax output
Gaussian (Normal)	Continuous	μ, σ²	Weight initialisation, VAE latent space, noise in diffusion
Uniform	Continuous	a, b	Random initialisation (Glorot), random sampling for data augmentation
Exponential	Continuous	λ	Waiting times, certain energy-based model connections

The Gaussian distribution deserves special attention. It appears everywhere because of the Central Limit Theorem (CLT): the sum of many independent random variables, regardless of their individual distributions, tends toward a Gaussian distribution as the number of terms grows. Network activations are sums of many weighted inputs — this is why they tend to be approximately Gaussian, which in turn is why Gaussian assumptions about noise and priors are reasonable in so many ML contexts.

Expectation, Variance, Covariance

The expectation E[X] is the probability-weighted average value of a random variable. For a discrete variable: E[X] = Σₓ x · P(X = x). For a continuous variable: E[X] = ∫ x f(x) dx. Expectation is linear: E[aX + bY] = aE[X] + bE[Y], which is why loss functions defined as expectations over training data decompose into sums over examples.

E[X] = Σₓ x · P(X = x) (discrete) Var[X] = E[(X − E[X])²] = E[X²] − (E[X])²

Variance Var[X] measures how spread out values are around the mean. High variance means the distribution is wide; low variance means values cluster tightly around the mean. Standard deviation σ = √Var[X] is in the same units as X, which makes it more interpretable.

Covariance measures whether two variables tend to move together:

Cov(X, Y) = E[(X − E[X])(Y − E[Y])]

Positive covariance: when X is high, Y tends to be high. Negative covariance: when X is high, Y tends to be low. Zero covariance: no linear relationship. The covariance matrix Σ for a d-dimensional random vector x contains all pairwise covariances Σᵢⱼ = Cov(xᵢ, xⱼ) and variances on the diagonal. Covariance matrices appear in PCA, Gaussian distributions, Kalman filters, and the analysis of training data correlations. Correlated features cause multicollinearity, which inflates variance of fitted coefficients — understanding covariance tells you when to apply decorrelation preprocessing.

Bayes' Theorem

Bayes' theorem relates conditional probabilities and is arguably the single most important formula in probabilistic machine learning:

P(A | B) = P(B | A) · P(A) / P(B)

In ML terminology: given observed data D and model parameters θ,

P(θ | D) = P(D | θ) · P(θ) / P(D) posterior = likelihood × prior / evidence

Prior P(θ): what you believe about the parameters before seeing data. L2 regularisation (weight decay) is equivalent to placing a Gaussian prior on weights.
Likelihood P(D | θ): how probable the observed data is, given those parameters. This is what maximum likelihood estimation maximises.
Posterior P(θ | D): the updated belief after seeing data. Bayesian inference computes the full posterior; point estimates collapse it to a single value.
Evidence P(D): the marginalisation over all parameter values — usually intractable to compute exactly, which motivates variational inference and MCMC methods.

The frequentist perspective treats parameters as fixed (unknown) quantities and probabilities as long-run frequencies. The Bayesian perspective treats parameters as random variables with distributions representing degrees of belief. In practice, most deep learning is frequentist (MLE-based), but Bayesian ideas appear in regularisation, uncertainty estimation, hyperparameter search, and the probabilistic graphical model foundations of generative models.

Maximum Likelihood Estimation (MLE)

MLE finds the parameter values θ that make the observed training data most probable. Given N independent training examples x₁, …, xN, the likelihood is the product of individual probabilities:

L(θ) = P(x₁, …, xN | θ) = ∏ᵢ P(xᵢ | θ) θ_MLE = argmax_θ L(θ) = argmax_θ Σᵢ log P(xᵢ | θ)

The log-likelihood trick replaces the product with a sum by taking the logarithm. This is valid because log is monotonically increasing (maximising log L is equivalent to maximising L), and it transforms numerically unstable products of small probabilities into stable sums of log-probabilities.

The critical connection to deep learning: cross-entropy loss is negative log-likelihood. When you train a classifier to minimise cross-entropy, you are performing MLE for a categorical distribution. The loss for a single example with true class y is:

ℓ(θ) = −log P(y | x; θ) = −log softmax(f_θ(x))[y]

Summing this over all training examples gives the average negative log-likelihood — which is exactly the cross-entropy loss minimised by every classification neural network. Understanding MLE demystifies why cross-entropy is the "right" loss: it is the natural probabilistic objective for categorical prediction problems.

Key Distributions in Deep Learning

Distribution	Where used	Key parameter
Gaussian N(μ, σ²)	Weight initialisation, VAE latent space, diffusion noise schedule	Mean μ, variance σ²
Categorical	Token sampling (softmax output), multiclass output head	Temperature T scales logits before softmax
Bernoulli	Dropout (keep probability p), binary classification sigmoid	p = probability of 1
Uniform	Glorot weight initialisation (U[−√6/n, √6/n])	a = lower bound, b = upper bound
Dirichlet	Prior over categorical distributions in topic models (LDA)	Concentration α

The softmax function converts a vector of real-valued logits z into a valid probability distribution over K classes:

softmax(z)ₖ = exp(zₖ) / Σⱼ exp(zⱼ)

Softmax outputs are a categorical distribution. The temperature parameter T (common in LLM sampling) divides the logits before softmax: softmax(z/T). Low T (T < 1) sharpens the distribution toward the most probable token (more deterministic). High T (T > 1) flattens it, giving more probability mass to low-probability tokens (more creative/random output).

Statistical Hypothesis Testing Basics

When you compare two models — model A achieves 87.3% accuracy, model B achieves 88.1% — is that difference meaningful? Statistical hypothesis testing provides a framework to answer this rigorously.

The null hypothesis H₀ typically asserts no difference (e.g., both models have equal expected accuracy). The alternative hypothesis H₁ asserts there is a difference. A p-value is the probability of observing a difference at least as large as measured, assuming H₀ is true. A small p-value (typically < 0.05) is evidence against H₀.

The p-value is NOT the probability that H₀ is true

A p-value of 0.03 means: if H₀ were true, there is a 3% chance of seeing data this extreme. It does not mean there is a 97% chance your model is genuinely better. This misinterpretation is one of the most common errors in empirical ML evaluation. With enough data, trivially small differences will be statistically significant. Always report effect size alongside p-values.

A t-test compares means from two samples, accounting for variance and sample size. In model evaluation, you might run each model on 5 random seeds and apply a paired t-test to see whether the accuracy difference is consistent across seeds or just the result of one lucky run.

Effect size quantifies the practical magnitude of a difference, independently of sample size. Cohen's d measures the difference in means divided by pooled standard deviation. A statistically significant improvement of 0.1% accuracy after testing on a million examples may have negligible effect size and be practically irrelevant for deployment decisions.

Good Evaluation Practice

Report mean ± standard deviation over multiple runs
Use paired tests when comparing the same eval set
Check both statistical significance and effect size
Use held-out test sets never seen during development

Caveats in ML Benchmarking

Benchmark data contamination inflates apparent accuracy
Aggregate metrics can mask demographic disparities
Leaderboard overfitting: iterating on test sets inflates scores
Compute differences confound model architecture comparisons

Checklist: Do You Understand This?

Why does minimising cross-entropy loss in a classifier correspond to maximum likelihood estimation under a categorical distribution?
In Bayes' theorem, what does the prior represent, and which regularisation technique is equivalent to placing a Gaussian prior on model weights?
If the temperature of a language model is set to 0.1 vs 2.0, how does the shape of the output distribution change, and what effect does that have on generated text?
What is the difference between statistical significance and effect size, and why do you need both when comparing two models?
Why does the Gaussian distribution appear so commonly in ML, and what theorem explains this prevalence?
If the covariance between two input features is strongly positive, what does that imply for a model that treats them as independent?