Advanced

Linear Algebra for ML

If you strip away the software abstractions — PyTorch, Keras, TensorFlow — what remains is a vast system of matrix operations executing on specialised hardware. Every forward pass through a neural network is a sequence of matrix multiplications and elementwise transformations. Every attention head computes three matrix products and a softmax. Every embedding lookup retrieves a row from a weight matrix. Linear algebra is not background knowledge for machine learning; it is machine learning, expressed in mathematical notation.

This page builds the concepts you need to read ML papers, understand what GPUs are actually doing, and reason about model architecture at a level below the framework API.

Why Linear Algebra

Data in ML is almost always represented as numeric arrays. A single greyscale image of 28×28 pixels is a matrix with 784 values. A batch of 32 such images is a 3-dimensional tensor of shape (32, 28, 28). A sentence passed through a tokeniser becomes a sequence of integer IDs, which are then looked up in an embedding matrix to produce a 2D matrix of shape (sequence_length, embedding_dim).

Neural network layers are parameterised linear transformations. A fully-connected layer with 512 input features and 256 output features holds a weight matrix W of shape (256, 512) and a bias vector b of shape (256,). The forward computation is simply y = Wx + b. Stacking many such layers — with non-linear activation functions between them — creates the expressive function approximators we call deep networks.

Transformer attention is three simultaneous matrix projections (Q, K, V) followed by a scaled dot-product computation. The attention score between every query-key pair is a matrix multiplication; the weighted aggregation of values is another. The entire mechanism is expressible in four lines of linear algebra, which is precisely why it maps so efficiently onto the massively parallel matrix engines inside GPUs and TPUs.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Vectors

A vector is an ordered list of numbers. In ML, a vector typically represents a single data point, a single token embedding, or the activations of a layer for one input. Notation: bold lowercase v or a column vector written as a vertical list. When we write vector shapes, we say v ∈ ℝⁿ meaning v has n real-valued components.

The dot product (inner product) of two vectors measures their alignment:

u · v = Σᵢ uᵢvᵢ = u₁v₁ + u₂v₂ + … + uₙvₙ

The L2 norm (Euclidean length) of a vector is:

‖v‖₂ = √(v₁² + v₂² + … + vₙ²) = √(vᵀv)

Cosine similarity divides the dot product by both norms, giving a value in [−1, 1] that measures directional similarity regardless of magnitude:

cos(u, v) = (u · v) / (‖u‖₂ · ‖v‖₂)

Cosine similarity is ubiquitous in retrieval. When you store document embeddings in a vector database and query with a question embedding, you rank documents by cosine similarity. Two embeddings close to 1.0 are semantically similar; close to 0 are unrelated; close to −1.0 are roughly opposite in meaning.

Matrices

A matrix is a 2D array of numbers with m rows and n columns — shape notation m×n. Matrices represent linear transformations: multiplying a matrix by a vector maps the vector to a new space.

The matrix-vector product Ax = b takes an m×n matrix A and an n-dimensional vector x and produces an m-dimensional vector b. Each component of b is the dot product of the corresponding row of A with x. This is how a neural network layer transforms its input: A is the weight matrix, x is the input activation vector, and b is the pre-activation output.

b = Ax where A is (m×n), x is (n×1), b is (m×1)

The matrix-matrix product AB requires the inner dimensions to match: if A is (m×k) and B is (k×n), the result is (m×n). Each element of the result (i, j) is the dot product of row i of A with column j of B. This is the most compute-intensive operation in deep learning — a single transformer forward pass for a large model performs billions of floating-point multiply-accumulate operations in this form.

The transpose Aᵀ flips a matrix along its diagonal — rows become columns. A (3×5) matrix transposed becomes (5×3). The transpose is used constantly in attention (computing QKᵀ), in gradient computation during backpropagation, and in formulating the normal equations for least-squares problems.

Key Matrix Operations

Operation	Notation	ML use
Dot product	uᵀv	Cosine similarity, attention scores, logistic regression
Matrix multiply	AB	Linear layer forward pass, attention (QKᵀ, ·V), projections
Transpose	Aᵀ	Attention keys, backprop gradient flow, normal equations
Elementwise multiply	A ⊙ B	Dropout masks, gating mechanisms (GLU, SiGLU), attention masks
Outer product	uvᵀ	LoRA updates (low-rank decomposition), gradient outer products
Matrix inverse	A⁻¹	Solving linear systems, Gaussian process inference, covariance inversion

Eigenvalues & Eigenvectors

An eigenvector of a square matrix A is a non-zero vector v that, when multiplied by A, only changes in scale (not direction):

Av = λv

The scalar λ is the corresponding eigenvalue. Geometrically, applying the transformation A to its eigenvector stretches or compresses it by λ but leaves its orientation unchanged. If λ is negative, the vector flips direction.

In practice, why do eigenvalues matter? Consider the covariance matrix of a dataset. Its eigenvectors point in the directions of greatest variance in the data — these are the principal components used in PCA (Principal Component Analysis). The corresponding eigenvalues tell you how much variance each direction explains. PCA projects data onto the top-k eigenvectors, discarding directions of low variance, producing a lower-dimensional representation that preserves most information.

Eigenvalues also govern training dynamics. The largest eigenvalue of the Hessian matrix (second derivatives of the loss) controls the maximum safe learning rate for gradient descent. Models with a flat loss landscape (small eigenvalues) are associated with better generalisation — understanding this connects directly to sharpness-aware minimisation (SAM) and related techniques.

Singular Value Decomposition (SVD)

SVD generalises eigendecomposition to non-square matrices and is one of the most powerful tools in all of applied mathematics. Any m×n matrix A can be factored as:

A = U Σ Vᵀ

Where U is an m×m orthogonal matrix (left singular vectors), Σ is an m×n diagonal matrix with non-negative values σ₁ ≥ σ₂ ≥ … ≥ 0 on the diagonal (singular values), and Vᵀ is the transpose of an n×n orthogonal matrix (right singular vectors).

The singular values in Σ tell you how much each component contributes. If you keep only the top-k singular values and discard the rest (truncated SVD), you get the best rank-k approximation of A in terms of Frobenius norm. This is the mathematical foundation of dimensionality reduction, topic modelling via LSA, and image compression.

Most importantly for modern ML: SVD is the theoretical basis of LoRA(Low-Rank Adaptation). LoRA fine-tunes a pretrained weight matrix W by adding a low-rank update ΔW = BA, where B is (d×r) and A is (r×k) with r ≪ min(d,k). This is equivalent to approximating the needed weight change with a rank-r matrix — precisely the insight from truncated SVD. Fine-tuning with LoRA updates only ~0.1–1% of parameters because most of the useful adaptation signal lives in a low-dimensional subspace.

SVD Applications in ML

PCA via covariance matrix SVD
LoRA / QLoRA weight adaptation
Recommendation systems (matrix factorisation)
Latent Semantic Analysis (LSA)
Pseudoinverse computation (A⁺ = VΣ⁺Uᵀ)

Intuition for Singular Values

Large σᵢ → that dimension captures major structure
Near-zero σᵢ → that dimension is noise or redundant
Rank of A = number of non-zero singular values
Condition number = σ_max / σ_min (numerical stability)

Norms and Distances

Norms measure the "size" of a vector or matrix. Several norms appear regularly in ML, each with different properties and uses.

For a vector x ∈ ℝⁿ:

L1 norm: ‖x‖₁ = Σᵢ |xᵢ| (sum of absolute values) L2 norm: ‖x‖₂ = √(Σᵢ xᵢ²) (Euclidean length) L∞ norm: ‖x‖∞ = max(|xᵢ|) (largest absolute value)

L2 regularisation (weight decay) adds λ‖w‖₂² to the loss, penalising large weights and encouraging solutions where parameter values are small and spread across many dimensions. The gradient of the L2 penalty is simply 2λw — proportional to the weight itself — which leads to the characteristic "shrinkage towards zero" of weight decay.

L1 regularisation (LASSO) adds λ‖w‖₁ to the loss. Its gradient is ±λ for each non-zero weight (constant, not proportional), which has a qualitatively different effect: L1 regularisation drives many weights to exactly zero, producing sparse models. Sparse weights are easier to interpret and compress. Mixture-of-Experts sparsity and attention sparsity both relate to this L1-inducing sparsity principle.

The Frobenius norm of a matrix A is the L2 norm of all its entries treated as a single vector:

‖A‖_F = √(Σᵢ Σⱼ aᵢⱼ²) = √(trace(AᵀA)) = √(Σᵢ σᵢ²)

The last equality — that the Frobenius norm equals the square root of the sum of squared singular values — ties norms directly back to SVD and is the reason truncated SVD provides the optimal low-rank approximation measured by Frobenius norm.

Common Confusion: L2 Norm vs L2 Loss

"L2 regularisation" and "L2 loss" are different things. L2 regularisation adds ‖w‖₂² to the loss to penalise large weights. L2 loss (mean squared error) uses ‖y - ŷ‖₂² as the training objective. Both involve squared L2 norms but they play entirely different roles in the optimisation problem.

Checklist: Do You Understand This?

Can you explain why a neural network layer forward pass is a matrix-vector product, and what the shape of the weight matrix determines?
If two embedding vectors have cosine similarity 0.92, what does that mean semantically? What would cosine similarity of 0.0 mean?
What does SVD decompose a matrix into, and why does truncated SVD give the best low-rank approximation?
How does LoRA use the low-rank factorisation insight from SVD to reduce fine-tuning parameter count?
What is the difference in practical effect between L1 and L2 regularisation on model weights?
If a covariance matrix has eigenvectors and eigenvalues, what does the largest eigenvalue's eigenvector represent about the dataset?