Local & Self-Hosted Inference

Running models locally means no per-token costs, full data privacy, and offline capability — but it requires hardware investment and limits which models you can use. This section covers the when and how of self-hosted inference, from a developer laptop to production servers.

In This Section

When to Self-Host

The decision framework — data residency requirements, cost at scale, latency constraints, and the hardware math that determines if local inference makes sense.

Ollama: Local Model Serving

The easiest way to run open-weight models locally — setup, model library, API compatibility, and practical performance expectations across hardware.

LM Studio & Alternatives

LM Studio for desktop GUI inference, plus vLLM and llama.cpp for production workloads — when to use which tool.