Advanced

TPU vs GPU vs Custom Silicon

GPUs are general-purpose parallel processors adapted for AI. But the scale and economic pressure of AI training and inference has driven every major cloud provider and several startups to design custom accelerators optimized for specific workloads. Understanding this landscape matters for architects choosing infrastructure, engineers interpreting benchmark claims, and anyone reasoning about the future of AI compute costs.

Google TPU — Systolic Array Design

Google developed the first Tensor Processing Unit (TPU) in 2015, deploying it internally for inference before training support was added. Unlike a GPU — which runs many small matrix operations across thousands of independent cores — a TPU uses a systolic array: a regular grid of multiply-accumulate units where data flows through in a wave pattern, with each unit passing its result to the next.

Systolic array intuition

Imagine a 256×256 grid of multipliers. Data flows left-to-right and top-to-bottom simultaneously. Each cell multiplies two numbers and adds to a running sum. By the time data exits the array, an entire matrix multiplication is complete — with zero memory traffic for intermediate results. This eliminates the HBM bandwidth bottleneck that plagues GPUs for regular-shaped matrix math.

Generation	Year	Peak FLOPS (BF16)	Key Feature
TPU v1	2015	92 TOPS (INT8)	Inference only; deployed in Google Search/Translate
TPU v3	2018	420 TFLOPS	Training support; liquid cooling; first Pod (1024 chips)
TPU v4	2021	275 TFLOPS	3D torus network; 4096-chip Pods; trained PaLM, Gemini
TPU v5e / v5p	2023–24	393 TFLOPS (v5e)	Inference (v5e) and training (v5p) variants; available on GCP

TPU Pods connect thousands of chips via a custom 3D torus network topology, enabling all-to-all communication across the entire cluster without routing through a CPU or Ethernet switch. This is what makes it practical to train models like Gemini Ultra across thousands of chips simultaneously.

TPU vs GPU — Real Tradeoffs

When TPU wins

Fixed-shape, regular workloads (standard transformer training)
Very large scale (thousands of chips, weeks of training)
JAX-native code (XLA compiler, native TPU compilation)
Google Cloud commitment (only available on GCP)
Cost: TPU v5e often 2–3× cheaper per FLOP for training on GCP

When GPU wins

Dynamic shapes and custom CUDA kernels
PyTorch-native code (90% of research)
Fine-tuning, LoRA, experimental architectures
Multi-cloud or on-premises deployment
Custom ops not yet supported in XLA

Groq LPU — Deterministic Inference

Groq's Language Processing Unit (LPU) takes a fundamentally different architectural approach. Where GPUs and TPUs schedule work dynamically across parallel cores, the LPU uses a sequential, compiled execution model: the entire computation graph is compiled ahead of time, and execution is completely deterministic with no runtime scheduling overhead.

Architecture

SIMD-free; 230K+ cores laid out in a single stream; data flows through at compiler-determined timing. No thread scheduling, no cache misses from dynamic dispatch.

Performance

500+ tokens/second for Llama 3 70B as of 2024 — far exceeding H100 single-GPU inference. Low and consistent latency (great for interactive applications).

Limitations

Inference only (no training). Requires model to be compiled to GroqChip architecture. Limited to supported model architectures. On-chip SRAM is the memory (no HBM — limits supported model sizes).

Cerebras CS-3 — Wafer-Scale Computing

Cerebras takes a radical approach: instead of building a chip and connecting multiple chips, they build a chip the size of an entire 300mm silicon wafer. The CS-3 (2024) is one monolithic die containing 4 trillion transistors, 900,000 AI cores, and 44 GB of on-chip SRAM.

The key insight: inter-chip communication (NVLink, PCIe, high-speed Ethernet) is always slower and more power-hungry than on-chip interconnects. By eliminating chip boundaries entirely, Cerebras achieves essentially zero communication latency between all 900K cores.

The wafer-scale challenge: yield

Standard chip manufacturing expects some defective transistors per wafer — that is why chips are cut small (defects are rare per small die). Cerebras routes around defective cores at wafer level using redundant cores and programmable routing. This is a solved engineering challenge for them but represents a significant manufacturing moat.

Cerebras targets very large model training where memory bandwidth is the bottleneck. 44 GB of on-chip SRAM is tiny by HBM standards (H100 has 80 GB), but SRAM bandwidth is orders of magnitude higher than HBM. For certain workloads (very large batch sizes, specific model architectures), this can outperform H100 clusters.

Tenstorrent — Open Architecture

Founded by Jim Keller (chip architect behind AMD Zen, Apple A4/A5, Tesla FSD chip), Tenstorrent builds AI accelerators on a RISC-V base with an open-architecture philosophy. Their Wormhole and Blackhole chips target datacenter inference and edge deployment.

Tenstorrent's differentiator is a software-first approach: their TT-Metalium compiler framework gives developers low-level control over data movement and compute scheduling, competing with the transparency of CUDA. The chips are also physically tiled — multiple chips connect like puzzle pieces, with the network-on-chip extending across chip boundaries.

AWS Trainium / Inferentia

Amazon built custom silicon to reduce dependency on NVIDIA and lower training/inference costs for EC2 customers. Two separate chips serve two purposes:

AWS Trainium 2

Purpose: model training. Competitive BF16 throughput vs H100 at lower cost on EC2. Trn2 instances offer 100 Gbps NeuronLink between chips.

Ecosystem: AWS Neuron SDK + PyTorch/JAX integration. Growing support for popular architectures (Llama, Falcon, Stable Diffusion). Some custom ops still require porting.

AWS Inferentia 2

Purpose: inference at scale. Inf2 instances optimized for latency and throughput. Competitive cost-per-token for standard transformer architectures.

Used by Amazon internally for Alexa, recommendations, search ranking. Available as Inf2 EC2 instances; cost often 30–50% lower than comparable GPU-based inference.

Apple Neural Engine — On-Device Inference

Apple's Neural Engine (ANE) is a dedicated hardware block on every Apple Silicon chip (M-series Macs, A-series iPhones, iPads). It is not a general-purpose accelerator — it is specifically optimized for low-power, on-device inference using the CoreML framework.

M4 (2024)

CPU: 10-core

GPU: 10-core

Neural Engine: 38 TOPS

Unified Memory: 32–128 GB

A17 Pro (iPhone 15 Pro)

CPU: 6-core

GPU: 6-core

Neural Engine: 35 TOPS

Apple Silicon integrates CPU, GPU, Neural Engine, and unified memory on one die

The key architectural advantage of Apple Silicon for on-device AI is unified memory: CPU, GPU, and Neural Engine all share the same physical memory pool, eliminating PCIe copy overhead. A 7B parameter model in 4-bit quantization (~4 GB) fits in M4 unified memory and can run at usable speeds (10–30 tokens/second) using tools like Ollama or LM Studio.

38 TOPS sounds modest next to H100's 989 TFLOPS, but Apple achieves this at approximately 15W vs H100's 700W. For on-device inference, performance-per-watt and privacy (no data leaving the device) matter more than raw throughput.

Accelerator Landscape Comparison

Chip	Maker	Primary Use	Key Advantage	Key Limitation
H100 SXM5	NVIDIA	Training + inference	Best ecosystem, CUDA	Expensive, supply-constrained
MI300X	AMD	Training + inference	192 GB HBM3, high bandwidth	ROCm ecosystem smaller
TPU v5p	Google	Large-scale training	Pod scale, cost on GCP	GCP lock-in, JAX-first
LPU	Groq	Inference only	Fastest token throughput	No training, limited model support
CS-3	Cerebras	Large model training	Zero inter-chip latency	Niche, expensive, small memory
Trainium 2	AWS	Training (AWS)	Cost on EC2 vs GPU	AWS lock-in, ecosystem smaller
Neural Engine (M4)	Apple	On-device inference	Perf/watt, privacy, unified memory	CoreML only, no training

Checklist: Do You Understand This?

What is a systolic array, and how does it differ from a GPU's SIMT execution model?
Why are TPUs only available through Google Cloud, and what training workloads are they best suited for?
What is the core architectural innovation of Groq's LPU, and why does deterministic execution improve latency?
What is wafer-scale integration, and what manufacturing challenge does Cerebras have to solve to make it viable?
Why does Apple Silicon's unified memory architecture benefit on-device LLM inference specifically?
When would you choose AWS Inferentia over H100 for inference workloads?
What is the primary software ecosystem gap that prevents AMD MI300X from displacing NVIDIA for training?