Intermediate

LM Studio & Local Inference Tools

Ollama is the easiest path to local inference, but it's not the only one. LM Studio provides a richer desktop experience, vLLM powers high-throughput production serving, and llama.cpp sits underneath everything as the portable inference engine. This page covers the full toolkit and when to use each.

LM Studio: Desktop GUI for Local Models

LM Studio is a free desktop application (macOS, Windows, Linux) for discovering, downloading, and running GGUF models locally. It targets non-technical users who want a ChatGPT-like experience without sending data to the cloud.

LM Studio strengths

Visual model browser (search Hugging Face, filter by size)
Download manager with quantisation selection
Chat UI with system prompt and parameter controls
Local server at localhost:1234 (OpenAI-compatible)
GGUF format; uses llama.cpp under the hood
GPU and CPU inference; auto-detects hardware

LM Studio limitations

Single user — not designed for multi-user concurrent workloads
Only GGUF format (no PyTorch/Safetensors weights)
No batch inference, streaming API, or fine-tuning
Desktop app — not suitable for server deployments

Best use case: Individuals who want to experiment with local models via a polished UI without using the command line. Good for non-technical team members trying AI workflows with sensitive data.

GGUF Format

GGUF (GGML Unified Format) is the file format used by llama.cpp, Ollama, and LM Studio for quantised model weights. Key characteristics:

Single file containing model weights + metadata + tokenizer
Supports mixed-precision quantisation (different layers at different precision)
Efficient for CPU inference and can offload layers to GPU as available
Most popular quantisation: Q4_K_M (recommended default), Q8_0 (near-full quality)
Files available on Hugging Face (search "GGUF" for any model)

llama.cpp: The Portable Inference Engine

llama.cpp is the C++ library that powers both Ollama and LM Studio under the hood. You can use it directly for maximum control:

Runs on CPU (no GPU required) — portable to any Linux/macOS/Windows machine
Uses Metal (Apple), CUDA (NVIDIA), or ROCm (AMD) when available
Python bindings via llama-cpp-python
OpenAI-compatible server: python -m llama_cpp.server --model model.gguf
Supports multi-GPU offloading

Best use case: Embedding local inference in a Python application, running on servers without Docker/Ollama infrastructure, or when you need fine-grained control over inference parameters.

vLLM: Production High-Throughput Serving

vLLM is the production inference server of choice for teams serving open-weight models to multiple concurrent users at scale. Key innovation: PagedAttention— a memory management technique that dramatically increases throughput.

Feature	Ollama	vLLM
Target user	Individual developers	Production teams
Concurrent users	1–3 (sequential)	Hundreds (continuous batching)
Throughput	Moderate	3–5× higher via PagedAttention
Model format	GGUF (quantised)	HuggingFace weights (FP16/BF16/INT8)
Setup complexity	Simple (one command)	Moderate (Python, GPU drivers)
API compatibility	OpenAI-compatible	OpenAI-compatible

# Install and serve Llama 3.1 8B with vLLM:
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

vLLM's OpenAI-compatible server means existing clients work without code changes. Best for internal APIs serving a team, or external APIs with moderate user loads.

Hugging Face TGI

Text Generation Inference (TGI) from Hugging Face is an alternative to vLLM for production serving, packaged as a Docker container:

docker run --gpus all --shm-size 1g \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

TGI is well-integrated with the Hugging Face ecosystem and supports all HF model variants. Choose vLLM for maximum throughput; TGI for tighter HF integration and simpler Docker-based deployment.

Tool Selection Matrix

Scenario	Recommended tool
Individual developer experimenting with models	Ollama
Non-technical user wanting a chat UI	LM Studio
Production API serving 50+ concurrent users	vLLM
Team using Docker-based infrastructure	HF TGI or vLLM Docker
Embedding inference in a Python application	llama-cpp-python
CPU-only deployment (no GPU)	Ollama or llama.cpp
Maximum throughput on large GPU cluster	vLLM with tensor parallelism

Checklist: Do You Understand This?

What is GGUF format and which tools use it?
When is LM Studio the right tool compared to Ollama?
What is PagedAttention and why does it give vLLM higher throughput?
What is the main difference between vLLM and Hugging Face TGI?
For a production API serving 100+ concurrent users, which tool would you choose?