Advanced

Calculus & Optimization for ML

Training a neural network is fundamentally an optimization problem. You have a loss function — a number that measures how wrong the model is — and your goal is to find the set of weights that makes that number as small as possible. Calculus is the tool that tells you which direction to move the weights, and by how much. Without derivatives there is no training; without the chain rule there is no backpropagation; without gradient descent there is no way to navigate a loss landscape with billions of parameters.

Why Calculus in ML

The loss function L maps a set of weights W to a scalar value representing model error. To minimize L you need to know its local slope with respect to every weight — that slope is the derivative. For a network with millions of parameters, you need the slope along every dimension simultaneously, which is the gradient. Backpropagation is simply the chain rule applied recursively through the computation graph to compute that gradient efficiently. The entire modern deep learning ecosystem — from transformers to diffusion models — rests on this one idea: differentiate through the computation, then move weights in the direction that reduces loss.

Derivatives and Gradients

A derivative measures the instantaneous rate of change of a function with respect to one variable. For a function f(x), the derivative f'(x) tells you how much f changes when x changes by a tiny amount. When a function has multiple inputs — as every neural network layer does — you take a partial derivative for each input, holding the others fixed.

∂L/∂w₁ — how much the loss changes when weight w₁ changes, all other weights fixed

The gradient ∇L is the vector of all partial derivatives stacked together. It points in the direction of steepest ascent of L. To reduce L we move in the opposite direction — the negative gradient. This is gradient descent: take a small step downhill on the loss surface with every iteration.

∇L(W) = [∂L/∂w₁, ∂L/∂w₂, ..., ∂L/∂wₙ]

The Chain Rule

The chain rule is the mathematical principle that makes backpropagation work. It states that the derivative of a composed function equals the product of the derivatives of each component:

d/dx [f(g(x))] = f′(g(x)) · g′(x)

A neural network is exactly a composition of functions. Consider a minimal two-layer network:

Linear layer: z = Wx + b
Activation: a = σ(z)
Loss: L = loss(a, y)

To update W, we need ∂L/∂W. Applying the chain rule:

∂L/∂W = (∂L/∂a) · (∂a/∂z) · (∂z/∂W) = loss_grad · σ′(z) · xᵀ

Each term is easy to compute locally. The power of backpropagation is that it applies this rule layer by layer, from the output back to the input, reusing intermediate results. No matter how deep the network, the chain rule lets you compute every gradient in a single backward pass.

Gradient Descent Variants

Not all gradient descent is equal. The choice of how many examples you use to estimate the gradient per step significantly affects training dynamics.

Variant	Update Rule	Key Property
Batch GD	W ← W - η · ∇L(full dataset)	Exact gradient, stable, very slow on large datasets
SGD	W ← W - η · ∇L(one example)	Noisy gradient, fast updates, can escape saddle points
Mini-batch SGD	W ← W - η · ∇L(batch of k)	Best of both — used in virtually all modern training
Adam	Per-parameter adaptive η	Momentum + adaptive rates; default for most deep learning

Convexity

A convex function has the property that any straight line between two points on the function lies above the curve. For optimization this is ideal: every local minimum is also the global minimum, so gradient descent is guaranteed to find the best solution. Logistic regression and SVMs have convex loss functions.

Neural networks are non-convex. The loss landscape has saddle points, flat plateaus, narrow valleys, and many local minima. In theory this should make optimization intractable. In practice, for large overparameterized networks, most local minima are nearly as good as the global minimum — the loss landscape is surprisingly benign at scale. Empirical results from training GPT-scale models confirm this: even though we cannot guarantee optimality, stochastic gradient descent reliably finds excellent solutions.

Learning Rate and Saddle Points

The learning rate η controls how large each update step is. It is the most important hyperparameter in training:

Too High

Updates overshoot the minimum. Loss oscillates or diverges. The model never converges.

Too Low

Training is extremely slow. Gradient becomes tiny in flat regions. Model may get stuck before reaching a good minimum.

Saddle points are locations where the gradient is zero but the point is neither a maximum nor a minimum — the curvature is positive in some dimensions and negative in others. In high-dimensional spaces (millions of parameters), saddle points are far more common than true local minima. Pure gradient descent stalls at saddle points. Mini-batch SGD naturally escapes them because the noise in each gradient estimate provides random perturbations that kick the optimizer off the flat region.

Learning rate schedules — warmup then decay (cosine, linear) — are standard practice: start with a low rate, ramp up to avoid instability early in training, then decay to enable fine-grained convergence. Transformers universally use warmup.

Second-Order Methods

First-order methods only use the gradient (first derivatives). Second-order methods also use the Hessian — the matrix of second derivatives — to get curvature information. Newton's method uses the Hessian to take curvature-aware steps that can converge much faster:

W ← W - H⁻¹ · ∇L (Newton's method)

The problem: for a network with n parameters, the Hessian has n² entries. A model with 100 million parameters would need a 10¹⁶-entry matrix — completely impractical. This is why all large-scale training uses first-order methods. L-BFGS (a quasi-Newton method) approximates the Hessian efficiently and is used for smaller problems like fine-tuning with limited data.

Adaptive Optimizers

Adaptive optimizers automatically adjust the learning rate for each parameter individually based on gradient history. This eliminates much of the sensitivity to a global learning rate.

Adam (Adaptive Moment Estimation)

Combines momentum (exponential moving average of past gradients) with RMSProp (exponential moving average of squared gradients). Each parameter gets its own effective learning rate, scaled by how consistent its gradient direction has been. Default choice for transformers, CNNs, and most modern architectures.

m_t = β₁·m_(t-1) + (1-β₁)·g_t (momentum)
v_t = β₂·v_(t-1) + (1-β₂)·g_t² (squared gradient)
W ← W - η · m̂_t / (√v̂_t + ε)

AdamW

Adam with decoupled weight decay. In vanilla Adam, L2 regularization interacts with the adaptive scaling in a suboptimal way. AdamW applies weight decay directly to weights rather than through the gradient. This is the standard optimizer for training large language models (GPT, BERT, Llama).

Adagrad

Accumulates all past squared gradients. Frequently updated parameters get smaller learning rates over time; infrequently updated parameters get larger rates. Useful for sparse data. Weakness: learning rate monotonically decreases and can reach near-zero — training stalls. RMSProp and Adam fix this by using exponential decay instead of full accumulation.

Practical Optimizer Guidance

Default for transformers and LLMs: AdamW with cosine decay + warmup
Vision models (CNNs): SGD with momentum often outperforms Adam at convergence, though Adam trains faster
Fine-tuning small datasets: L-BFGS or lower-LR Adam with early stopping
Reinforcement learning: Adam is standard; PPO training uses Adam

Checklist: Do You Understand This?

Can you explain what a gradient is and why ML follows the negative gradient?
Can you apply the chain rule to compute ∂L/∂W for a two-layer network?
Do you understand why mini-batch SGD is preferred over full-batch gradient descent?
Can you explain what a saddle point is and why it matters more than local minima in deep learning?
Do you understand what makes AdamW different from Adam, and why it is preferred for LLMs?
Can you explain why second-order methods are not used for large neural networks?