When to Self-Host Models
Self-hosting an open-weight model means running inference on your own hardware or cloud infrastructure. It is not always better — but in specific scenarios it is clearly the right choice. This page gives you the decision framework, cost math, and hardware guide to make that decision confidently.
The Five Triggers for Self-Hosting
1. Data Residency / Privacy
Data cannot leave your infrastructure: HIPAA PHI, classified data, client source code under NDA, unreleased financial data. Cloud APIs are off the table regardless of cost.
2. Cost at Scale
Token volumes where API pricing exceeds self-hosted infrastructure costs. Rough crossover: ~100M tokens/day for mid-tier models. At 1B tokens/day, self-hosting is dramatically cheaper.
3. Latency
Applications requiring sub-100ms first token (on-device voice pipelines, kiosk applications, offline use). Network round-trips to cloud APIs add 50–300ms.
4. Customisation
Domain-specific fine-tuning that changes model behaviour in ways cloud APIs cannot accommodate. Self-hosting is required to serve custom fine-tuned weights.
5. Offline / Disconnected Environments
Edge deployments, industrial systems, aircraft, submarines, remote field operations. No internet connectivity means no cloud API.
Cost Break-Even Analysis
The key calculation: at what query volume is self-hosted infrastructure cheaper than API pricing?
| Setup | One-time hardware cost | Monthly infra cost | Effective cost/1M tokens at 100M/day |
|---|---|---|---|
| RTX 4090 (24GB) PC for 70B model | ~$2,500–4,000 | ~$50 (power, cooling) | ~$0.02–0.05 |
| 2× A100 80GB cloud instance | $0 | ~$4,000–8,000 (cloud GPU) | ~$0.04–0.08 |
| GPT-4o API (direct) | $0 | ~$7,500 (at 100M tokens/day) | $2.50 |
| Llama 3.1 70B via Together.ai | $0 | ~$2,640 (at 100M tokens/day) | $0.88 |
Self-hosting beats cloud API pricing significantly at scale, but you trade infrastructure management complexity for cost savings. Cloud GPU instances (rented A100s) bridge the gap — you get self-hosted privacy without buying hardware.
Hardware Requirements by Model Size
| Model size | Quantisation | Min VRAM (GPU) | Consumer hardware options |
|---|---|---|---|
| 7B | Q4 (4-bit) | 6 GB | RTX 3060, M1/M2 16GB |
| 7B | FP16 (full) | 16 GB | RTX 3080 Ti, M2 Pro 16GB |
| 13B / 14B | Q4 | 10 GB | RTX 3080, M2 Pro |
| 70B | Q4 | 40 GB | RTX 4090 (24GB) + RAM offload, M2 Ultra 96GB |
| 70B | FP16 | 140 GB | 2× A100 80GB, Mac Studio M2 Ultra 192GB |
Apple Silicon Advantage
Apple M-series chips have a unique advantage for local inference: unified memory architecture — CPU and GPU share the same high-bandwidth memory. This means:
- A MacBook Pro M3 Max (128GB) can run 70B models in FP16 without quality loss
- No GPU VRAM limit separate from system RAM
- llama.cpp and Ollama use Metal GPU acceleration natively
- Efficient power usage — fan-free inference for 7B and smaller models
For individual developers and small teams, a Mac Studio or Mac Pro with M-series chip is often the most cost-effective and capable local inference machine.
Quantisation: Bigger Models in Less VRAM
Quantisation reduces the precision of model weights to shrink VRAM requirements:
- FP16 (half precision) — Full quality; 2 bytes per parameter. A 70B model needs ~140GB VRAM.
- INT8 (Q8) — Minimal quality loss (<1% on most benchmarks); 1 byte per parameter. 70B needs ~70GB.
- INT4 (Q4) — Small but noticeable quality loss on hard tasks; ~0.5 bytes per parameter. 70B needs ~35GB. Most popular for consumer use.
- GGUF — The file format used by llama.cpp/Ollama for quantised models; supports mixed quantisation per layer (Q4_K_M is the most popular balanced option).
When NOT to Self-Host
- Frontier model capability required: GPT-5, Claude Opus, Gemini Ultra are not available as open weights. If you need frontier quality, API is your only option.
- Team lacks infrastructure skills: Running vLLM in production requires MLOps expertise. The "savings" evaporate if you need to hire specialists.
- Low volume / prototyping: Below ~10M tokens/day, API pricing is almost always more cost-effective when you include hardware amortisation.
- Rapid iteration required: New model releases happen every few weeks. Self-hosted means you manage upgrades; cloud APIs update automatically.
Hybrid Strategy
Most production teams end up with a hybrid:
- Local 7B model for classification, routing, simple extraction — zero API cost
- Cloud API (flagship tier) for complex generation, coding, analysis
- Cloud API (reasoning tier) for hard maths, logic, planning — on demand only
Route at inference time based on query complexity — a simple heuristic or small classifier can direct 60–80% of queries to the cheap local model, dramatically reducing total API spend.
Checklist: Do You Understand This?
- What are the five main triggers for choosing self-hosted inference over cloud APIs?
- At approximately what daily token volume does self-hosting start to beat cloud API pricing?
- What is the Apple Silicon unified memory architecture advantage for local inference?
- What is quantisation, and what is the trade-off between Q4 and FP16?
- Name two scenarios where self-hosting is the wrong choice despite potential cost savings.
- Describe a hybrid routing strategy that balances cost and quality.