LM Studio & Local Inference Tools
Ollama is the easiest path to local inference, but it's not the only one. LM Studio provides a richer desktop experience, vLLM powers high-throughput production serving, and llama.cpp sits underneath everything as the portable inference engine. This page covers the full toolkit and when to use each.
LM Studio: Desktop GUI for Local Models
LM Studio is a free desktop application (macOS, Windows, Linux) for discovering, downloading, and running GGUF models locally. It targets non-technical users who want a ChatGPT-like experience without sending data to the cloud.
LM Studio strengths
- Visual model browser (search Hugging Face, filter by size)
- Download manager with quantisation selection
- Chat UI with system prompt and parameter controls
- Local server at localhost:1234 (OpenAI-compatible)
- GGUF format; uses llama.cpp under the hood
- GPU and CPU inference; auto-detects hardware
LM Studio limitations
- Single user — not designed for multi-user concurrent workloads
- Only GGUF format (no PyTorch/Safetensors weights)
- No batch inference, streaming API, or fine-tuning
- Desktop app — not suitable for server deployments
Best use case: Individuals who want to experiment with local models via a polished UI without using the command line. Good for non-technical team members trying AI workflows with sensitive data.
GGUF Format
GGUF (GGML Unified Format) is the file format used by llama.cpp, Ollama, and LM Studio for quantised model weights. Key characteristics:
- Single file containing model weights + metadata + tokenizer
- Supports mixed-precision quantisation (different layers at different precision)
- Efficient for CPU inference and can offload layers to GPU as available
- Most popular quantisation:
Q4_K_M(recommended default),Q8_0(near-full quality) - Files available on Hugging Face (search "GGUF" for any model)
llama.cpp: The Portable Inference Engine
llama.cpp is the C++ library that powers both Ollama and LM Studio under the hood. You can use it directly for maximum control:
- Runs on CPU (no GPU required) — portable to any Linux/macOS/Windows machine
- Uses Metal (Apple), CUDA (NVIDIA), or ROCm (AMD) when available
- Python bindings via
llama-cpp-python - OpenAI-compatible server:
python -m llama_cpp.server --model model.gguf - Supports multi-GPU offloading
Best use case: Embedding local inference in a Python application, running on servers without Docker/Ollama infrastructure, or when you need fine-grained control over inference parameters.
vLLM: Production High-Throughput Serving
vLLM is the production inference server of choice for teams serving open-weight models to multiple concurrent users at scale. Key innovation: PagedAttention— a memory management technique that dramatically increases throughput.
| Feature | Ollama | vLLM |
|---|---|---|
| Target user | Individual developers | Production teams |
| Concurrent users | 1–3 (sequential) | Hundreds (continuous batching) |
| Throughput | Moderate | 3–5× higher via PagedAttention |
| Model format | GGUF (quantised) | HuggingFace weights (FP16/BF16/INT8) |
| Setup complexity | Simple (one command) | Moderate (Python, GPU drivers) |
| API compatibility | OpenAI-compatible | OpenAI-compatible |
# Install and serve Llama 3.1 8B with vLLM:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--dtype auto \
--api-key token-abc123vLLM's OpenAI-compatible server means existing clients work without code changes. Best for internal APIs serving a team, or external APIs with moderate user loads.
Hugging Face TGI
Text Generation Inference (TGI) from Hugging Face is an alternative to vLLM for production serving, packaged as a Docker container:
docker run --gpus all --shm-size 1g \
-p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-InstructTGI is well-integrated with the Hugging Face ecosystem and supports all HF model variants. Choose vLLM for maximum throughput; TGI for tighter HF integration and simpler Docker-based deployment.
Tool Selection Matrix
| Scenario | Recommended tool |
|---|---|
| Individual developer experimenting with models | Ollama |
| Non-technical user wanting a chat UI | LM Studio |
| Production API serving 50+ concurrent users | vLLM |
| Team using Docker-based infrastructure | HF TGI or vLLM Docker |
| Embedding inference in a Python application | llama-cpp-python |
| CPU-only deployment (no GPU) | Ollama or llama.cpp |
| Maximum throughput on large GPU cluster | vLLM with tensor parallelism |
Checklist: Do You Understand This?
- What is GGUF format and which tools use it?
- When is LM Studio the right tool compared to Ollama?
- What is PagedAttention and why does it give vLLM higher throughput?
- What is the main difference between vLLM and Hugging Face TGI?
- For a production API serving 100+ concurrent users, which tool would you choose?