Advanced

Feedforward Networks

A feedforward network is the foundational computational unit of deep learning. Information flows in one direction — from input to output — through a series of layers, each performing a linear transformation followed by a non-linear activation. Understanding how these networks work mechanically is essential before studying any specialised architecture: transformers, convolutional nets, and diffusion models all build on top of these same principles.

The Perceptron

Frank Rosenblatt introduced the perceptron in 1958 as a mathematical model of a biological neuron. The computation is straightforward: take a vector of inputs, multiply each by a learned weight, sum the results, add a bias term, then apply a threshold function. If the weighted sum exceeds the threshold, the neuron fires (outputs 1); otherwise it stays silent (outputs 0).

output = 1 if (w₁x₁ + w₂x₂ + ... + wₙxₙ + b) > 0, else 0

This model generated enormous excitement — and then a sharp backlash. Minsky and Papert demonstrated in their 1969 book Perceptrons that the single-layer model cannot solve the XOR problem: it can only learn decision boundaries that are straight lines in input space. XOR is not linearly separable — no single line can divide the four XOR input points into their correct classes. This result essentially froze neural network funding for over a decade, a period known as the first AI winter. The resolution, which came in the 1980s, was not to discard the perceptron but to stack multiple layers of them.

From Perceptron to MLP

A Multi-Layer Perceptron (MLP) adds one or more hidden layers between the input and output. The hidden layers project the input into a new representation, and the output layer maps that representation to the final prediction. The crucial insight is that stacked layers can represent non-linear decision boundaries of arbitrary complexity — they can curve, fold, and partition the input space in ways that a single layer cannot.

Input Layer

x₁

x₂

x₃

Hidden Layer 1

h₁

h₂

h₃

h₄

Hidden Layer 2

h₅

h₆

h₇

Output Layer

Multi-Layer Perceptron — each layer transforms the representation; the output layer produces the prediction

But depth alone is not enough. If every layer applies only a linear transformation, then the composition of all the layers is still a linear transformation. Formally, if layer 1 computes W₁x and layer 2 computes W₂(W₁x), the result is simply (W₂W₁)x — equivalent to a single matrix multiplication. No matter how many linear layers you stack, the network can only learn linear functions. This is why activation functions are not optional. They are what makes depth meaningful.

Anatomy of a Layer

Each layer in an MLP performs the same operation. Given an input vector x of dimension n_in, the layer produces an output vector of dimension n_out:

output = activation(W · x + b) W shape: (n_out × n_in) — one row per output neuron b shape: (n_out,) — one bias per output neuron parameter count: n_out × n_in + n_out

In practice, inputs arrive in batches. If a batch contains B examples, x has shape (n_in, B) and the output has shape (n_out, B). The matrix multiply W·x handles all B examples in parallel — this is the key reason GPUs accelerate deep learning so effectively. GPUs are designed to perform large matrix multiplications rapidly, and training a neural network is largely a sequence of such operations.

The parameter count scales as n_out × n_in per layer. A single hidden layer with 512 input features and 1024 hidden units requires 512 × 1024 + 1024 = 525,312 parameters — already half a million for one layer. This is why large models have billions of parameters: they have many wide layers stacked deep.

Activation Functions

The choice of activation function has evolved substantially as networks grew deeper. Earlier functions like sigmoid and tanh were replaced by ReLU for hidden layers because they avoid the vanishing gradient problem in deep stacks. More recent models use smoother variants suited to specific architectures.

Function	Formula	Output Range	Primary Use
Sigmoid	1 / (1 + e⁻ˣ)	[0, 1]	Binary classification output layer
Tanh	(eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)	[−1, 1]	Hidden layers (historical), RNNs
ReLU	max(0, x)	[0, ∞)	Default for hidden layers (CNNs, MLPs)
GELU	x · Φ(x)	≈ (−0.17, ∞)	Transformers (BERT, GPT)
SiLU / Swish	x · sigmoid(x)	≈ (−0.28, ∞)	LLaMA, Mistral, modern LLMs

Sigmoid and tanh both suffer from saturation: when x is very large or very small, the gradient of the function approaches zero. This causes the vanishing gradient problem in deep networks. ReLU solves this for positive inputs — its gradient is exactly 1 for any positive value, so it passes gradients upstream without shrinking them. However, ReLU can produce "dead neurons": if a neuron's weighted sum is always negative, its output is always zero and it never receives a gradient update, effectively becoming permanently inactive. GELU and SiLU are smooth, differentiable approximations of ReLU that avoid this hard zero and empirically outperform ReLU in large language model training.

The Universal Approximation Theorem

The Universal Approximation Theorem, proved by Cybenko (1989) and generalised by Hornik (1991), states that a feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of Rⁿ to arbitrary precision. The theorem is profound because it establishes that MLPs are not limited to some subset of functions — in principle, they can represent anything.

The practical limitation is the word "sufficient width." A shallow network may need exponentially many neurons to represent a function that a deeper network can represent with far fewer total parameters. This is the theoretical motivation for depth: deep networks are more parameter-efficient than wide shallow ones for many real-world functions. The theorem tells you what's theoretically possible. It says nothing about whether gradient descent will find the right parameters, how much data you need, or how long training will take. Those are separate, harder problems.

What UAT guarantees

Any continuous function can be approximated
One hidden layer is theoretically sufficient
Width can compensate for lack of depth

What UAT does not guarantee

That training will find the right weights
That the network will generalise to new data
That a shallow network is efficient

The Forward Pass

Consider a 3-layer MLP for a classification task with 784 input features (e.g., a flattened 28×28 image), two hidden layers of 256 neurons each, and 10 output classes. The forward pass proceeds as follows:

x shape: (784,) — raw input (e.g., pixel values) h₁ = ReLU(W₁ · x + b₁) W₁: (256, 784), h₁: (256,) h₂ = ReLU(W₂ · h₁ + b₂) W₂: (256, 256), h₂: (256,) y = softmax(W₃ · h₂ + b₃) W₃: (10, 256), y: (10,) Total parameters: W₁: 256 × 784 = 200,704 W₂: 256 × 256 = 65,536 W₃: 10 × 256 = 2,560 Biases: 256 + 256 + 10 = 522 Total: ~269,000

Each layer progressively transforms the representation. The first hidden layer might learn low-level patterns — combinations of pixel intensities that activate when certain edges or textures are present. The second hidden layer combines those into higher-level concepts — parts of shapes or consistent feature groupings. The output layer uses those high-level representations to make the final class prediction. This hierarchical representation learning is the deep learning advantage: the network discovers intermediate features automatically, rather than requiring hand-engineered feature extraction.

Softmax and Output Layer Design

The softmax function converts a vector of real-valued logits (unconstrained scores) into a probability distribution — all values sum to 1, and each value lies in (0, 1):

softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ) Temperature scaling: softmax(zᵢ / T) T < 1 → sharper distribution (more confident) T > 1 → flatter distribution (more uniform / uncertain)

The choice of output layer activation depends on the task. For binary classification, a single sigmoid neuron outputs a probability of the positive class. For multi-class classification (mutually exclusive classes), softmax over all class logits. For multi-label classification (multiple classes can be true simultaneously), sigmoid applied independently to each logit. For regression, no activation at all — the raw linear output is the prediction, allowing any real value.

Overfitting and Regularisation

Overfitting occurs when a network memorises the training data rather than learning generalisable patterns. The symptom is a widening gap between training loss (continues to fall) and validation loss (stalls or increases). Several standard techniques address this:

Dropout

During training, randomly set a fraction p of activations to zero. Each training step uses a different random sub-network. At inference time, all neurons are active and weights are scaled by (1−p). Forces the network not to rely on any single neuron and builds redundant representations.

Weight Decay (L2)

Add a penalty term λ · ‖W‖² to the loss function. This penalises large weights, pulling them toward zero. Equivalent to placing a Gaussian prior over weights in the Bayesian view. Prevents any single weight from becoming excessively large, which often indicates overfitting.

Batch Normalisation

Normalise activations within each mini-batch to zero mean and unit variance, then apply learned scale and shift parameters. Reduces internal covariate shift, allows higher learning rates, and provides mild regularisation. Standard in CNNs; layer normalisation is preferred in transformers.

Early stopping is another widely used approach: monitor validation loss during training and stop when it stops improving. It requires no changes to the network architecture and is computationally free. In practice, most training runs use a combination of weight decay and dropout, with batch normalisation in convolutional architectures.

Checklist: Do You Understand This?

Can you explain why a single-layer perceptron cannot solve XOR, and what mathematical property of the problem causes this?
Can you write the output equation for one MLP layer, including shapes of W, x, and b, and calculate the parameter count?
Can you explain why stacking linear layers without activation functions is equivalent to a single linear layer?
Can you name the activation function used in LLaMA-family models and explain why it was chosen over ReLU?
Can you state what the Universal Approximation Theorem guarantees and, equally importantly, what it does not guarantee?
Can you describe what dropout does during training versus inference and explain the intuition behind why it reduces overfitting?