Intermediate

When to Self-Host Models

Self-hosting an open-weight model means running inference on your own hardware or cloud infrastructure. It is not always better — but in specific scenarios it is clearly the right choice. This page gives you the decision framework, cost math, and hardware guide to make that decision confidently.

The Five Triggers for Self-Hosting

1. Data Residency / Privacy

Data cannot leave your infrastructure: HIPAA PHI, classified data, client source code under NDA, unreleased financial data. Cloud APIs are off the table regardless of cost.

2. Cost at Scale

Token volumes where API pricing exceeds self-hosted infrastructure costs. Rough crossover: ~100M tokens/day for mid-tier models. At 1B tokens/day, self-hosting is dramatically cheaper.

3. Latency

Applications requiring sub-100ms first token (on-device voice pipelines, kiosk applications, offline use). Network round-trips to cloud APIs add 50–300ms.

4. Customisation

Domain-specific fine-tuning that changes model behaviour in ways cloud APIs cannot accommodate. Self-hosting is required to serve custom fine-tuned weights.

5. Offline / Disconnected Environments

Edge deployments, industrial systems, aircraft, submarines, remote field operations. No internet connectivity means no cloud API.

Cost Break-Even Analysis

The key calculation: at what query volume is self-hosted infrastructure cheaper than API pricing?

Setup	One-time hardware cost	Monthly infra cost	Effective cost/1M tokens at 100M/day
RTX 4090 (24GB) PC for 70B model	~$2,500–4,000	~$50 (power, cooling)	~$0.02–0.05
2× A100 80GB cloud instance	$0	~$4,000–8,000 (cloud GPU)	~$0.04–0.08
GPT-4o API (direct)	$0	~$7,500 (at 100M tokens/day)	$2.50
Llama 3.1 70B via Together.ai	$0	~$2,640 (at 100M tokens/day)	$0.88

Self-hosting beats cloud API pricing significantly at scale, but you trade infrastructure management complexity for cost savings. Cloud GPU instances (rented A100s) bridge the gap — you get self-hosted privacy without buying hardware.

Hardware Requirements by Model Size

Model size	Quantisation	Min VRAM (GPU)	Consumer hardware options
7B	Q4 (4-bit)	6 GB	RTX 3060, M1/M2 16GB
7B	FP16 (full)	16 GB	RTX 3080 Ti, M2 Pro 16GB
13B / 14B	Q4	10 GB	RTX 3080, M2 Pro
70B	Q4	40 GB	RTX 4090 (24GB) + RAM offload, M2 Ultra 96GB
70B	FP16	140 GB	2× A100 80GB, Mac Studio M2 Ultra 192GB

Apple Silicon Advantage

Apple M-series chips have a unique advantage for local inference: unified memory architecture — CPU and GPU share the same high-bandwidth memory. This means:

A MacBook Pro M3 Max (128GB) can run 70B models in FP16 without quality loss
No GPU VRAM limit separate from system RAM
llama.cpp and Ollama use Metal GPU acceleration natively
Efficient power usage — fan-free inference for 7B and smaller models

For individual developers and small teams, a Mac Studio or Mac Pro with M-series chip is often the most cost-effective and capable local inference machine.

Quantisation: Bigger Models in Less VRAM

Quantisation reduces the precision of model weights to shrink VRAM requirements:

FP16 (half precision) — Full quality; 2 bytes per parameter. A 70B model needs ~140GB VRAM.
INT8 (Q8) — Minimal quality loss (<1% on most benchmarks); 1 byte per parameter. 70B needs ~70GB.
INT4 (Q4) — Small but noticeable quality loss on hard tasks; ~0.5 bytes per parameter. 70B needs ~35GB. Most popular for consumer use.
GGUF — The file format used by llama.cpp/Ollama for quantised models; supports mixed quantisation per layer (Q4_K_M is the most popular balanced option).

When NOT to Self-Host

Frontier model capability required: GPT-5, Claude Opus, Gemini Ultra are not available as open weights. If you need frontier quality, API is your only option.
Team lacks infrastructure skills: Running vLLM in production requires MLOps expertise. The "savings" evaporate if you need to hire specialists.
Low volume / prototyping: Below ~10M tokens/day, API pricing is almost always more cost-effective when you include hardware amortisation.
Rapid iteration required: New model releases happen every few weeks. Self-hosted means you manage upgrades; cloud APIs update automatically.

Hybrid Strategy

Most production teams end up with a hybrid:

Local 7B model for classification, routing, simple extraction — zero API cost
Cloud API (flagship tier) for complex generation, coding, analysis
Cloud API (reasoning tier) for hard maths, logic, planning — on demand only

Route at inference time based on query complexity — a simple heuristic or small classifier can direct 60–80% of queries to the cheap local model, dramatically reducing total API spend.

Checklist: Do You Understand This?

What are the five main triggers for choosing self-hosted inference over cloud APIs?
At approximately what daily token volume does self-hosting start to beat cloud API pricing?
What is the Apple Silicon unified memory architecture advantage for local inference?
What is quantisation, and what is the trade-off between Q4 and FP16?
Name two scenarios where self-hosting is the wrong choice despite potential cost savings.
Describe a hybrid routing strategy that balances cost and quality.