Local & Self-Hosted Inference
Running models locally means no per-token costs, full data privacy, and offline capability — but it requires hardware investment and limits which models you can use. This section covers the when and how of self-hosted inference, from a developer laptop to production servers.
In This Section
When to Self-Host
The decision framework — data residency requirements, cost at scale, latency constraints, and the hardware math that determines if local inference makes sense.
Ollama: Local Model Serving
The easiest way to run open-weight models locally — setup, model library, API compatibility, and practical performance expectations across hardware.
LM Studio & Alternatives
LM Studio for desktop GUI inference, plus vLLM and llama.cpp for production workloads — when to use which tool.