🧠All Things AI — by Subhojit DeyAll Things AI
🌱Start Here🔧Build with AIDaily StackDevelopersVibe CodingOthersLocal🏢Industry🛡️Legal🔬Deep Dive📰News
🧠 All Things AI
🌱🧠🔧⚡⚡🤖✨🔍🔶🎯💜⚡🪟🦙🤗🦞🔁🌊✕🔀🛠️🏢🛡️✅🏭🔬📰
🔬Deep Dive
Math Foundations
Neural Networks
Transformer Architecture
Scaling
LLM Pre-training
Alignment Techniques
Reasoning Internals
Interpretability
Model Architectures
Hardware & Compute
Fine-tuning & Adaptation
Research Skills
AI Economics & Impact
🔬Deep Dive
Math Foundations
Neural Networks
Transformer Architecture
Scaling
LLM Pre-training
Alignment Techniques
Reasoning Internals
Interpretability
Model Architectures
Hardware & Compute
Fine-tuning & Adaptation
Research Skills
AI Economics & Impact
Deep DiveTransformer Architecture

Transformer Architecture

The transformer architecture that underlies every modern LLM — from the core attention mechanism to positional encodings, architectural variants, and the full transformer block.

In This Section

Attention Mechanism — The Core Idea

Queries, keys, values, scaled dot-product attention, and Flash Attention.

Multi-Head Self-Attention

Why multiple heads, what each learns, GQA, and MQA.

The Full Transformer Block

LayerNorm, FFN, residuals, SwiGLU, and pre-norm vs post-norm.

Positional Encoding & Variants

Sinusoidal, learned, RoPE, ALiBi, and how context length extension works.

Transformer Variants — BERT, GPT, T5

Encoder-only, decoder-only, encoder-decoder — the three paradigms.

Previous← Encoder-DecoderNextAttention Mechanism →

Page built: 01 Jun 2026