Advanced

CNNs — Convolutional Networks

Convolutional Neural Networks (CNNs) were the dominant architecture for computer vision for nearly a decade and remain widely deployed in production today. Their design insight is simple but profound: images have spatial structure, and that structure should be baked into the architecture rather than learned from scratch. This page explains why that matters, how convolution implements the idea, and what the landmark CNN architectures contributed.

The Problem MLPs Miss

A Multi-Layer Perceptron (MLP) treats every input feature as independent. For images, this means every pixel is connected to every neuron in the next layer with its own learned weight. The spatial relationship between adjacent pixels — the fact that a pixel and its neighbour are almost certainly part of the same edge or texture — is ignored entirely.

The scale problem makes this worse. A 224×224 RGB image has 150,528 input values. If the first hidden layer has just 1,000 neurons, that single layer requires 150 million parameters — before any useful feature has been learned. The model would memorise training data rather than generalise. CNNs sidestep this by exploiting two constraints that are almost always true for images: nearby pixels are related, and the same local pattern (an edge, a corner) is meaningful wherever it appears in the image.

Input

Image (H×W×C)

Conv + ReLU

Conv Layer

ReLU

Pooling

Max Pool

Conv + ReLU

Conv Layer

ReLU

Output

Fully Connected

Softmax

Typical CNN architecture — alternating conv+ReLU blocks reduce spatial size while deepening feature maps

The Convolution Operation

The core operation is sliding a small matrix — called a filter or kernel — across the input. At each position the filter is placed, every element of the filter is multiplied by the corresponding input value beneath it, and the products are summed to a single scalar. This produces one value in the feature map (the output). The filter slides to the next position (by a distance called the stride) and the process repeats.

output[i, j] = Σ_m Σ_n input[i+m, j+n] · filter[m, n]

A 3×3 edge-detection filter might look like:

-1 0 1 -2 0 2 -1 0 1

Applied to an image, this filter produces large positive values where there is a vertical edge (dark on left, light on right) and large negative values for the opposite edge direction. Crucially, the network does not hand-code these filters — it learns them from data via backpropagation. Early in training the filters are noise; after training on millions of images they look strikingly like biological receptive fields.

Key CNN Concepts

Concept	Definition	Benefit
Local receptive field	Each neuron only sees a small local patch of the input	Encodes spatial locality; nearby pixels processed together
Weight sharing	The same filter (same weights) is applied at every position	Drastically fewer parameters; filter learning is efficient
Translation equivariance	If the input shifts, the feature map shifts the same way	Recognise features regardless of where they appear in the image
Stride	Step size of the filter as it slides	Controls output spatial size; stride 2 halves dimensions
Padding	Adding zeros around the input border	Preserves spatial size; prevents edge information loss

CNN Architecture Components

A full CNN stacks several types of layers:

Convolutional layers

Apply N filters to the input, producing N feature maps. Each filter detects one type of pattern. The depth of a conv layer (number of filters) is a hyperparameter — 64, 128, 256 channels are common.

Pooling layers

Downsample feature maps spatially. Max pooling takes the maximum value in each window (e.g. 2×2), retaining the strongest activation. Average pooling takes the mean. Pooling adds a degree of translation invariance (not just equivariance).

Activation (ReLU)

Applied after each conv layer. ReLU(x) = max(0, x) introduces non-linearity cheaply and largely avoids vanishing gradients, enabling deep stacking.

Fully connected layers

At the end of the convolutional stack, feature maps are flattened and passed through dense MLP layers to produce class scores or embeddings. Global average pooling (GAP) is a modern alternative that avoids the large parameter cost of FC layers.

Depth is crucial. Stacking convolutional layers creates a hierarchical feature detector. Each layer operates on the feature maps produced by the layer below, so later layers have an effective receptive field that covers a much larger region of the original image without any single layer needing a large kernel.

How Representations Evolve Through Layers

One of the most important empirical findings about CNNs is that the learned representations are interpretable and hierarchical:

Early layers learn low-level features: oriented edges, colour blobs, simple gradients. These look almost identical across very different trained CNNs and across different datasets.
Middle layers combine edges into textures and simple shapes — circles, grids, repeating patterns.
Later layers respond to complex object parts: eyes, wheels, faces, windows.
Final layers produce abstract class-discriminative features that fire for whole objects in context.

This hierarchy was confirmed by feature activation maximisation — synthesising an input that maximally activates a given neuron — and explains why transfer learning works so well: early and middle layer features generalise across tasks and datasets.

Classic Architectures

Architecture	Year	Key Innovation
LeNet-5	1998	Pioneered CNN for handwritten digit recognition; conv + pool + FC pattern established
AlexNet	2012	ReLU activations, dropout regularisation, GPU training — won ImageNet by a huge margin and ignited the deep learning era
VGG-16/19	2014	Showed that depth with uniform 3×3 filters beats wider shallower nets; highly transferable features
ResNet	2015	Residual (skip) connections: output = F(x) + x; solved vanishing gradient for very deep nets; trained 152-layer networks reliably
EfficientNet	2019	Compound scaling: jointly scale width, depth, and resolution by a fixed ratio; SOTA accuracy at much lower FLOP count

ResNet's residual connections deserve special mention. The core problem with very deep networks was that gradients vanished as they propagated back through many layers. The skip connection creates a gradient highway: even if the learned residual F(x) is near zero, the identity path x guarantees gradient flow. This enabled networks 10× deeper than VGG, with better accuracy and generalisation.

CNNs Beyond Images

The convolution idea generalises beyond 2D images. 1D convolutions slide a filter along a sequence — useful for time-series classification, raw audio waveforms, and character-level text. Temporal CNNs (TCNs) stack 1D convolutions with dilated kernels to capture long-range temporal dependencies efficiently without recurrence.

Vision Transformers (ViT), introduced in 2020, now challenge CNNs on most image benchmarks. ViT splits the image into patches, treats each patch as a token, and applies self-attention. Trained on very large datasets, ViTs outperform CNNs; at smaller data scales, CNNs still have an advantage because their inductive biases (locality, weight sharing) act as a useful prior. In practice:

Use CNNs when...

Deploying to edge / mobile devices (low FLOP count)
Small to medium training data sizes
Latency-sensitive real-time inference
Fine-tuning a pre-trained vision backbone

Use ViTs when...

Large-scale pretraining available (ImageNet-21k, LAION)
Multimodal tasks (CLIP, Flamingo use ViT image encoders)
Tasks needing global context across the whole image
Unified architectures combining vision and language

Checklist: Do You Understand This?

Can you explain why an MLP applied directly to pixels does not scale, and what property of images CNNs exploit instead?
What does weight sharing mean and why does it reduce the parameter count compared to a fully connected layer?
What is the difference between translation equivariance (from convolution) and translation invariance (from pooling)?
What problem did residual connections in ResNet solve, and why does the identity shortcut help gradient flow?
In what scenarios would you choose a CNN backbone over a Vision Transformer for a production deployment?