CNNs โ Convolutional Networks
Convolutional Neural Networks (CNNs) were the dominant architecture for computer vision for nearly a decade and remain widely deployed in production today. Their design insight is simple but profound: images have spatial structure, and that structure should be baked into the architecture rather than learned from scratch. This page explains why that matters, how convolution implements the idea, and what the landmark CNN architectures contributed.
The Problem MLPs Miss
A Multi-Layer Perceptron (MLP) treats every input feature as independent. For images, this means every pixel is connected to every neuron in the next layer with its own learned weight. The spatial relationship between adjacent pixels โ the fact that a pixel and its neighbour are almost certainly part of the same edge or texture โ is ignored entirely.
The scale problem makes this worse. A 224ร224 RGB image has 150,528 input values. If the first hidden layer has just 1,000 neurons, that single layer requires 150 million parameters โ before any useful feature has been learned. The model would memorise training data rather than generalise. CNNs sidestep this by exploiting two constraints that are almost always true for images: nearby pixels are related, and the same local pattern (an edge, a corner) is meaningful wherever it appears in the image.
The Convolution Operation
The core operation is sliding a small matrix โ called a filter or kernel โ across the input. At each position the filter is placed, every element of the filter is multiplied by the corresponding input value beneath it, and the products are summed to a single scalar. This produces one value in the feature map (the output). The filter slides to the next position (by a distance called the stride) and the process repeats.
A 3ร3 edge-detection filter might look like:
Applied to an image, this filter produces large positive values where there is a vertical edge (dark on left, light on right) and large negative values for the opposite edge direction. Crucially, the network does not hand-code these filters โ it learns them from data via backpropagation. Early in training the filters are noise; after training on millions of images they look strikingly like biological receptive fields.
Key CNN Concepts
| Concept | Definition | Benefit |
|---|---|---|
| Local receptive field | Each neuron only sees a small local patch of the input | Encodes spatial locality; nearby pixels processed together |
| Weight sharing | The same filter (same weights) is applied at every position | Drastically fewer parameters; filter learning is efficient |
| Translation equivariance | If the input shifts, the feature map shifts the same way | Recognise features regardless of where they appear in the image |
| Stride | Step size of the filter as it slides | Controls output spatial size; stride 2 halves dimensions |
| Padding | Adding zeros around the input border | Preserves spatial size; prevents edge information loss |
CNN Architecture Components
A full CNN stacks several types of layers:
Convolutional layers
Apply N filters to the input, producing N feature maps. Each filter detects one type of pattern. The depth of a conv layer (number of filters) is a hyperparameter โ 64, 128, 256 channels are common.
Pooling layers
Downsample feature maps spatially. Max pooling takes the maximum value in each window (e.g. 2ร2), retaining the strongest activation. Average pooling takes the mean. Pooling adds a degree of translation invariance (not just equivariance).
Activation (ReLU)
Applied after each conv layer. ReLU(x) = max(0, x) introduces non-linearity cheaply and largely avoids vanishing gradients, enabling deep stacking.
Fully connected layers
At the end of the convolutional stack, feature maps are flattened and passed through dense MLP layers to produce class scores or embeddings. Global average pooling (GAP) is a modern alternative that avoids the large parameter cost of FC layers.
Depth is crucial. Stacking convolutional layers creates a hierarchical feature detector. Each layer operates on the feature maps produced by the layer below, so later layers have an effective receptive field that covers a much larger region of the original image without any single layer needing a large kernel.
How Representations Evolve Through Layers
One of the most important empirical findings about CNNs is that the learned representations are interpretable and hierarchical:
- Early layers learn low-level features: oriented edges, colour blobs, simple gradients. These look almost identical across very different trained CNNs and across different datasets.
- Middle layers combine edges into textures and simple shapes โ circles, grids, repeating patterns.
- Later layers respond to complex object parts: eyes, wheels, faces, windows.
- Final layers produce abstract class-discriminative features that fire for whole objects in context.
This hierarchy was confirmed by feature activation maximisation โ synthesising an input that maximally activates a given neuron โ and explains why transfer learning works so well: early and middle layer features generalise across tasks and datasets.
Classic Architectures
| Architecture | Year | Key Innovation |
|---|---|---|
| LeNet-5 | 1998 | Pioneered CNN for handwritten digit recognition; conv + pool + FC pattern established |
| AlexNet | 2012 | ReLU activations, dropout regularisation, GPU training โ won ImageNet by a huge margin and ignited the deep learning era |
| VGG-16/19 | 2014 | Showed that depth with uniform 3ร3 filters beats wider shallower nets; highly transferable features |
| ResNet | 2015 | Residual (skip) connections: output = F(x) + x; solved vanishing gradient for very deep nets; trained 152-layer networks reliably |
| EfficientNet | 2019 | Compound scaling: jointly scale width, depth, and resolution by a fixed ratio; SOTA accuracy at much lower FLOP count |
ResNet's residual connections deserve special mention. The core problem with very deep networks was that gradients vanished as they propagated back through many layers. The skip connection creates a gradient highway: even if the learned residual F(x) is near zero, the identity path x guarantees gradient flow. This enabled networks 10ร deeper than VGG, with better accuracy and generalisation.
CNNs Beyond Images
The convolution idea generalises beyond 2D images. 1D convolutions slide a filter along a sequence โ useful for time-series classification, raw audio waveforms, and character-level text. Temporal CNNs (TCNs) stack 1D convolutions with dilated kernels to capture long-range temporal dependencies efficiently without recurrence.
Vision Transformers (ViT), introduced in 2020, now challenge CNNs on most image benchmarks. ViT splits the image into patches, treats each patch as a token, and applies self-attention. Trained on very large datasets, ViTs outperform CNNs; at smaller data scales, CNNs still have an advantage because their inductive biases (locality, weight sharing) act as a useful prior. In practice:
Use CNNs when...
- Deploying to edge / mobile devices (low FLOP count)
- Small to medium training data sizes
- Latency-sensitive real-time inference
- Fine-tuning a pre-trained vision backbone
Use ViTs when...
- Large-scale pretraining available (ImageNet-21k, LAION)
- Multimodal tasks (CLIP, Flamingo use ViT image encoders)
- Tasks needing global context across the whole image
- Unified architectures combining vision and language
Checklist: Do You Understand This?
- Can you explain why an MLP applied directly to pixels does not scale, and what property of images CNNs exploit instead?
- What does weight sharing mean and why does it reduce the parameter count compared to a fully connected layer?
- What is the difference between translation equivariance (from convolution) and translation invariance (from pooling)?
- What problem did residual connections in ResNet solve, and why does the identity shortcut help gradient flow?
- In what scenarios would you choose a CNN backbone over a Vision Transformer for a production deployment?