Advanced

RNNs, LSTMs & Sequence Models

Before transformers, recurrent neural networks were the dominant architecture for any task involving sequences — language modelling, machine translation, speech recognition, and time-series prediction. Understanding their mechanics and limitations is not merely historical: it explains precisely why the transformer's design choices matter, and it informs ongoing work on architectures like RWKV and Mamba that attempt to recover the computational efficiency of recurrence while matching transformer quality.

The Sequence Modelling Problem

Language, audio, and time-series data share a common structure: the correct output at each step depends on what came before. A fixed-size feedforward network has no mechanism for this — it maps a fixed-length input to a fixed-length output with no memory of previous computations. The sequence modelling challenge is to propagate relevant information across variable-length inputs while keeping computation tractable.

The basic recipe of a recurrent neural network (RNN) is elegant: maintain a hidden state vector h that is updated at each time step as a function of the current input and the previous hidden state. After processing the full sequence, h encodes everything the network has seen.

hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + b) hₜ — hidden state at step t (working memory) xₜ — input at step t (e.g. token embedding) Wₕ — hidden-to-hidden weight matrix (shared across all steps) Wₓ — input-to-hidden weight matrix (shared across all steps)

The same weight matrices W_h and W_x are applied at every time step — the network is "unrolled" across the sequence, with weight sharing enforcing that the learned rule for updating memory is consistent regardless of position. Training uses backpropagation through time (BPTT): unroll the full sequence into a very deep feedforward graph, compute a loss at each step, and backpropagate gradients through the entire unrolled graph.

The Vanishing Gradient Problem

The weight sharing that makes RNNs elegant also creates their central failure mode. When backpropagating through an unrolled RNN of length T, the gradient with respect to an early hidden state h₀ involves multiplying through T applications of the same Jacobian matrix ∂h_t/∂h_t-1.

∂L/∂h₀ = ∂L/∂hₜ · ∏ᵢ₌₁ᵀ (∂hᵢ/∂hᵢ₋₁) Each factor: ∂hᵢ/∂hᵢ₋₁ = Wₕᵀ · diag(tanh'(...) ) If singular values of Wₕᵀ < 1: product → 0 (vanishing) If singular values of Wₕᵀ > 1: product → ∞ (exploding)

Since tanh saturates at ±1 with a derivative near zero at extreme values, the Jacobian for any saturated neuron is nearly zero. A product of T near-zero matrices decays exponentially. In practice, plain RNNs cannot reliably propagate gradient more than about 10–15 steps. For a sentence of 50 words, the network has essentially no training signal from the beginning of the sentence by the time it reaches the end. Exploding gradients are typically addressed with gradient clipping; vanishing gradients require architectural solutions.

LSTM: Long Short-Term Memory

Hochreiter and Schmidhuber (1997) introduced the Long Short-Term Memory cell to solve the vanishing gradient problem. The key insight is to separate memory into two streams: a cell state c_t (long-term memory) and a hidden state h_t (working memory). The cell state flows through the sequence with only additive updates, creating a gradient highway that bypasses the multiplicative decay.

Forget Gate (f)

Decides what to erase from the cell state. Sigmoid output in [0,1] — 0 means fully forget, 1 means fully retain.

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

Input Gate (i)

Decides what new information to write to the cell state. Gate value × candidate update.

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi) g̃ₜ = tanh(Wg · [hₜ₋₁, xₜ] + bg)

Output Gate (o)

Decides what portion of the cell state to expose as the hidden state h_t.

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo) hₜ = oₜ ⊙ tanh(cₜ)

The cell state update combines the forget gate (selective erasure) and the input gate (selective writing):

cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ g̃ₜ fₜ ⊙ cₜ₋₁ — retain relevant past memory (additive update) iₜ ⊙ g̃ₜ — write new candidate memory (additive update) Key property: ∂cₜ/∂cₜ₋₁ = fₜ (elementwise, not a matrix product) When fₜ ≈ 1: gradient flows unchanged — no vanishing

The critical property is the additive update to c_t. Because the cell state changes by addition rather than multiplication by a weight matrix, the gradient of c_t with respect to c_t-1 is simply the forget gate value — a number close to 1 for important memories. Gradients flow through this path without the exponential decay that kills plain RNNs. Sigmoid gates (outputs 0 to 1) act as differentiable switches, while tanh squashes candidate updates and hidden state outputs to prevent unbounded growth.

GRU: Gated Recurrent Unit

Cho et al. (2014) introduced the Gated Recurrent Unit as a streamlined alternative to the LSTM. GRU merges the cell and hidden states into a single vector and uses only two gates rather than three, reducing parameter count by about 25%.

Reset Gate (r)

Controls how much of the previous hidden state to use when computing the candidate update. When r ≈ 0, the candidate is computed from the input alone — the gate resets memory for a fresh start. Useful when the model encounters a topic shift.

rₜ = σ(Wr · [hₜ₋₁, xₜ])

Update Gate (z)

Controls how much to interpolate between the previous hidden state and the candidate update. When z ≈ 1, the previous state is copied unchanged — enabling long-range memory without vanishing gradients, analogous to LSTM's forget gate set to 1.

zₜ = σ(Wz · [hₜ₋₁, xₜ])

h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ]) — candidate update hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ — interpolated output

Empirically, GRU and LSTM perform comparably on most tasks. GRU trains slightly faster and requires less memory, making it preferable when model size is constrained. LSTM tends to edge ahead on tasks requiring very precise memory control over long horizons. In practice, both have been largely superseded by transformers for language tasks, though GRU remains a common choice in production time-series and streaming systems.

Bidirectional RNNs

A standard RNN processes tokens left to right, so the hidden state at position i reflects only what came before. For tasks like named entity recognition, sentiment classification, or question answering, context from both sides is useful. A bidirectional RNN runs two RNNs in parallel — one left-to-right and one right-to-left — then concatenates the hidden states at each position.

Forward →

h₁ᶠ … hₙᶠ

→

Concat [hᶠ; hᵇ]

full context at each step

→

← Backward

hₙᵇ … h₁ᵇ

Bidirectional RNN: both directions concatenated at each position

The concatenated state [h_i^f; h_i^b] at position i has seen the full sequence from both ends. This enables high-quality per-token representations for classification and tagging. The critical limitation is that bidirectional models require the complete sequence upfront, so they cannot be used autoregressively for generation. BERT (bidirectional transformer encoder) is strong for classification but cannot generate text token-by-token for precisely this reason.

Why Transformers Replaced RNNs

Despite gating, RNNs carry three structural disadvantages that transformers address directly:

Sequential Computation

Step t depends on step t-1. The entire sequence must be processed in order — no GPU parallelism across time steps during training. A sequence of 1,000 tokens requires 1,000 serial steps. Transformers compute all positions simultaneously via matrix operations, achieving massive parallelism.

Fixed-Size Bottleneck

All information about a long sequence must be compressed into a hidden state of fixed dimensionality (typically 512–2048). For very long sequences this creates an information bottleneck — early tokens get overwritten. Transformers attend directly to any past token via the KV cache.

Long-Range Dependency

Even with gating, information must flow through every intermediate step between two distant tokens. LSTMs struggle reliably beyond 100–200 token dependencies. Transformers connect any two positions with O(1) path length regardless of distance.

Where RNNs Are Still Used

RNNs have not disappeared entirely. Several niches retain them in 2025:

Streaming Inference

RNNs process one token at a time with O(1) memory — the fixed-size hidden state. For applications responding to a continuous audio or sensor stream (keyword detection, real-time telemetry), this is decisive. A transformer must materialise the full context window for each generated token, making streaming costly.

Constrained Hardware

Microcontrollers and embedded systems with kilobytes of RAM cannot run even small transformer blocks. LSTM-based models remain the architecture of choice for wake-word detection, anomaly detection on sensor data, and on-device sequence tasks where memory is the binding constraint.

RWKV and Mamba: Modern RNN-Style Architectures

A significant research thread in 2023–2025 has re-examined recurrence as an architectural primitive, motivated by the inference efficiency of RNNs (O(n) memory, O(1) per-step compute) versus transformers (O(n²) KV cache for long contexts).

Pure Transformer

Parallel training, O(n²) KV cache, best quality

Pure RNN

Sequential training, O(1) inference, quality ceiling

GPT / Llama

Mamba-2

RWKV-6

Classic LSTM

RWKV (Receptance Weighted Key Value, Peng et al.) reformulates the RNN update equation so that the time-mixing operation can be expressed as a linear attention variant, enabling parallel training while retaining O(1) inference per step. RWKV-6 models at 7B and 14B parameters approach comparable transformer models on standard language benchmarks.

Mamba (Gu and Dao, 2023) uses a selective state-space model (SSM) where the state transition parameters are functions of the input — allowing the model to selectively remember or forget based on content, analogous to LSTM gating but expressible as a hardware-efficient scan operation on GPUs. Mamba-2 (2024) simplifies the architecture to use structured state spaces that map more cleanly to tensor operations and has been scaled to competitive performance at multi-billion parameter counts. Both architectures represent active frontier research into whether recurrence can match attention quality at scale.

Checklist: Do You Understand This?

Can you write the plain RNN update equation and explain what the shared weight matrices W_h and W_x represent?
Can you explain why multiplying a gradient through T identical Jacobians causes vanishing, and which property of tanh makes it worse?
Can you name the three LSTM gates, describe what each controls, and explain why the additive cell state update prevents vanishing gradients?
Can you describe the two GRU gates and explain why a GRU uses fewer parameters than an LSTM while achieving comparable performance?
Can you explain why bidirectional RNNs cannot be used for autoregressive generation, and name one task where bidirectionality is strictly necessary?
Can you state the three structural limitations of RNNs that transformers address, and name one situation where an RNN is still preferable to a transformer?
Can you describe at a high level what property RWKV and Mamba recover from recurrent architectures and why that property matters for inference efficiency?