TPU vs GPU vs Custom Silicon
GPUs are general-purpose parallel processors adapted for AI. But the scale and economic pressure of AI training and inference has driven every major cloud provider and several startups to design custom accelerators optimized for specific workloads. Understanding this landscape matters for architects choosing infrastructure, engineers interpreting benchmark claims, and anyone reasoning about the future of AI compute costs.
Google TPU — Systolic Array Design
Google developed the first Tensor Processing Unit (TPU) in 2015, deploying it internally for inference before training support was added. Unlike a GPU — which runs many small matrix operations across thousands of independent cores — a TPU uses a systolic array: a regular grid of multiply-accumulate units where data flows through in a wave pattern, with each unit passing its result to the next.
Systolic array intuition
Imagine a 256×256 grid of multipliers. Data flows left-to-right and top-to-bottom simultaneously. Each cell multiplies two numbers and adds to a running sum. By the time data exits the array, an entire matrix multiplication is complete — with zero memory traffic for intermediate results. This eliminates the HBM bandwidth bottleneck that plagues GPUs for regular-shaped matrix math.
| Generation | Year | Peak FLOPS (BF16) | Key Feature |
|---|---|---|---|
| TPU v1 | 2015 | 92 TOPS (INT8) | Inference only; deployed in Google Search/Translate |
| TPU v3 | 2018 | 420 TFLOPS | Training support; liquid cooling; first Pod (1024 chips) |
| TPU v4 | 2021 | 275 TFLOPS | 3D torus network; 4096-chip Pods; trained PaLM, Gemini |
| TPU v5e / v5p | 2023–24 | 393 TFLOPS (v5e) | Inference (v5e) and training (v5p) variants; available on GCP |
TPU Pods connect thousands of chips via a custom 3D torus network topology, enabling all-to-all communication across the entire cluster without routing through a CPU or Ethernet switch. This is what makes it practical to train models like Gemini Ultra across thousands of chips simultaneously.
TPU vs GPU — Real Tradeoffs
When TPU wins
- Fixed-shape, regular workloads (standard transformer training)
- Very large scale (thousands of chips, weeks of training)
- JAX-native code (XLA compiler, native TPU compilation)
- Google Cloud commitment (only available on GCP)
- Cost: TPU v5e often 2–3× cheaper per FLOP for training on GCP
When GPU wins
- Dynamic shapes and custom CUDA kernels
- PyTorch-native code (90% of research)
- Fine-tuning, LoRA, experimental architectures
- Multi-cloud or on-premises deployment
- Custom ops not yet supported in XLA
Groq LPU — Deterministic Inference
Groq's Language Processing Unit (LPU) takes a fundamentally different architectural approach. Where GPUs and TPUs schedule work dynamically across parallel cores, the LPU uses a sequential, compiled execution model: the entire computation graph is compiled ahead of time, and execution is completely deterministic with no runtime scheduling overhead.
Architecture
SIMD-free; 230K+ cores laid out in a single stream; data flows through at compiler-determined timing. No thread scheduling, no cache misses from dynamic dispatch.
Performance
500+ tokens/second for Llama 3 70B as of 2024 — far exceeding H100 single-GPU inference. Low and consistent latency (great for interactive applications).
Limitations
Inference only (no training). Requires model to be compiled to GroqChip architecture. Limited to supported model architectures. On-chip SRAM is the memory (no HBM — limits supported model sizes).
Cerebras CS-3 — Wafer-Scale Computing
Cerebras takes a radical approach: instead of building a chip and connecting multiple chips, they build a chip the size of an entire 300mm silicon wafer. The CS-3 (2024) is one monolithic die containing 4 trillion transistors, 900,000 AI cores, and 44 GB of on-chip SRAM.
The key insight: inter-chip communication (NVLink, PCIe, high-speed Ethernet) is always slower and more power-hungry than on-chip interconnects. By eliminating chip boundaries entirely, Cerebras achieves essentially zero communication latency between all 900K cores.
The wafer-scale challenge: yield
Standard chip manufacturing expects some defective transistors per wafer — that is why chips are cut small (defects are rare per small die). Cerebras routes around defective cores at wafer level using redundant cores and programmable routing. This is a solved engineering challenge for them but represents a significant manufacturing moat.
Cerebras targets very large model training where memory bandwidth is the bottleneck. 44 GB of on-chip SRAM is tiny by HBM standards (H100 has 80 GB), but SRAM bandwidth is orders of magnitude higher than HBM. For certain workloads (very large batch sizes, specific model architectures), this can outperform H100 clusters.
Tenstorrent — Open Architecture
Founded by Jim Keller (chip architect behind AMD Zen, Apple A4/A5, Tesla FSD chip), Tenstorrent builds AI accelerators on a RISC-V base with an open-architecture philosophy. Their Wormhole and Blackhole chips target datacenter inference and edge deployment.
Tenstorrent's differentiator is a software-first approach: their TT-Metalium compiler framework gives developers low-level control over data movement and compute scheduling, competing with the transparency of CUDA. The chips are also physically tiled — multiple chips connect like puzzle pieces, with the network-on-chip extending across chip boundaries.
AWS Trainium / Inferentia
Amazon built custom silicon to reduce dependency on NVIDIA and lower training/inference costs for EC2 customers. Two separate chips serve two purposes:
AWS Trainium 2
Purpose: model training. Competitive BF16 throughput vs H100 at lower cost on EC2. Trn2 instances offer 100 Gbps NeuronLink between chips.
Ecosystem: AWS Neuron SDK + PyTorch/JAX integration. Growing support for popular architectures (Llama, Falcon, Stable Diffusion). Some custom ops still require porting.
AWS Inferentia 2
Purpose: inference at scale. Inf2 instances optimized for latency and throughput. Competitive cost-per-token for standard transformer architectures.
Used by Amazon internally for Alexa, recommendations, search ranking. Available as Inf2 EC2 instances; cost often 30–50% lower than comparable GPU-based inference.
Apple Neural Engine — On-Device Inference
Apple's Neural Engine (ANE) is a dedicated hardware block on every Apple Silicon chip (M-series Macs, A-series iPhones, iPads). It is not a general-purpose accelerator — it is specifically optimized for low-power, on-device inference using the CoreML framework.
Apple Silicon integrates CPU, GPU, Neural Engine, and unified memory on one die
The key architectural advantage of Apple Silicon for on-device AI is unified memory: CPU, GPU, and Neural Engine all share the same physical memory pool, eliminating PCIe copy overhead. A 7B parameter model in 4-bit quantization (~4 GB) fits in M4 unified memory and can run at usable speeds (10–30 tokens/second) using tools like Ollama or LM Studio.
38 TOPS sounds modest next to H100's 989 TFLOPS, but Apple achieves this at approximately 15W vs H100's 700W. For on-device inference, performance-per-watt and privacy (no data leaving the device) matter more than raw throughput.
Accelerator Landscape Comparison
| Chip | Maker | Primary Use | Key Advantage | Key Limitation |
|---|---|---|---|---|
| H100 SXM5 | NVIDIA | Training + inference | Best ecosystem, CUDA | Expensive, supply-constrained |
| MI300X | AMD | Training + inference | 192 GB HBM3, high bandwidth | ROCm ecosystem smaller |
| TPU v5p | Large-scale training | Pod scale, cost on GCP | GCP lock-in, JAX-first | |
| LPU | Groq | Inference only | Fastest token throughput | No training, limited model support |
| CS-3 | Cerebras | Large model training | Zero inter-chip latency | Niche, expensive, small memory |
| Trainium 2 | AWS | Training (AWS) | Cost on EC2 vs GPU | AWS lock-in, ecosystem smaller |
| Neural Engine (M4) | Apple | On-device inference | Perf/watt, privacy, unified memory | CoreML only, no training |
Checklist: Do You Understand This?
- What is a systolic array, and how does it differ from a GPU's SIMT execution model?
- Why are TPUs only available through Google Cloud, and what training workloads are they best suited for?
- What is the core architectural innovation of Groq's LPU, and why does deterministic execution improve latency?
- What is wafer-scale integration, and what manufacturing challenge does Cerebras have to solve to make it viable?
- Why does Apple Silicon's unified memory architecture benefit on-device LLM inference specifically?
- When would you choose AWS Inferentia over H100 for inference workloads?
- What is the primary software ecosystem gap that prevents AMD MI300X from displacing NVIDIA for training?