🧠 All Things AI
Intermediate

LM Studio & Local Inference Tools

Ollama is the easiest path to local inference, but it's not the only one. LM Studio provides a richer desktop experience, vLLM powers high-throughput production serving, and llama.cpp sits underneath everything as the portable inference engine. This page covers the full toolkit and when to use each.

LM Studio: Desktop GUI for Local Models

LM Studio is a free desktop application (macOS, Windows, Linux) for discovering, downloading, and running GGUF models locally. It targets non-technical users who want a ChatGPT-like experience without sending data to the cloud.

LM Studio strengths

  • Visual model browser (search Hugging Face, filter by size)
  • Download manager with quantisation selection
  • Chat UI with system prompt and parameter controls
  • Local server at localhost:1234 (OpenAI-compatible)
  • GGUF format; uses llama.cpp under the hood
  • GPU and CPU inference; auto-detects hardware

LM Studio limitations

  • Single user — not designed for multi-user concurrent workloads
  • Only GGUF format (no PyTorch/Safetensors weights)
  • No batch inference, streaming API, or fine-tuning
  • Desktop app — not suitable for server deployments

Best use case: Individuals who want to experiment with local models via a polished UI without using the command line. Good for non-technical team members trying AI workflows with sensitive data.

GGUF Format

GGUF (GGML Unified Format) is the file format used by llama.cpp, Ollama, and LM Studio for quantised model weights. Key characteristics:

  • Single file containing model weights + metadata + tokenizer
  • Supports mixed-precision quantisation (different layers at different precision)
  • Efficient for CPU inference and can offload layers to GPU as available
  • Most popular quantisation: Q4_K_M (recommended default), Q8_0 (near-full quality)
  • Files available on Hugging Face (search "GGUF" for any model)

llama.cpp: The Portable Inference Engine

llama.cpp is the C++ library that powers both Ollama and LM Studio under the hood. You can use it directly for maximum control:

  • Runs on CPU (no GPU required) — portable to any Linux/macOS/Windows machine
  • Uses Metal (Apple), CUDA (NVIDIA), or ROCm (AMD) when available
  • Python bindings via llama-cpp-python
  • OpenAI-compatible server: python -m llama_cpp.server --model model.gguf
  • Supports multi-GPU offloading

Best use case: Embedding local inference in a Python application, running on servers without Docker/Ollama infrastructure, or when you need fine-grained control over inference parameters.

vLLM: Production High-Throughput Serving

vLLM is the production inference server of choice for teams serving open-weight models to multiple concurrent users at scale. Key innovation: PagedAttention— a memory management technique that dramatically increases throughput.

FeatureOllamavLLM
Target userIndividual developersProduction teams
Concurrent users1–3 (sequential)Hundreds (continuous batching)
ThroughputModerate3–5× higher via PagedAttention
Model formatGGUF (quantised)HuggingFace weights (FP16/BF16/INT8)
Setup complexitySimple (one command)Moderate (Python, GPU drivers)
API compatibilityOpenAI-compatibleOpenAI-compatible
# Install and serve Llama 3.1 8B with vLLM:
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

vLLM's OpenAI-compatible server means existing clients work without code changes. Best for internal APIs serving a team, or external APIs with moderate user loads.

Hugging Face TGI

Text Generation Inference (TGI) from Hugging Face is an alternative to vLLM for production serving, packaged as a Docker container:

docker run --gpus all --shm-size 1g \
  -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

TGI is well-integrated with the Hugging Face ecosystem and supports all HF model variants. Choose vLLM for maximum throughput; TGI for tighter HF integration and simpler Docker-based deployment.

Tool Selection Matrix

ScenarioRecommended tool
Individual developer experimenting with modelsOllama
Non-technical user wanting a chat UILM Studio
Production API serving 50+ concurrent usersvLLM
Team using Docker-based infrastructureHF TGI or vLLM Docker
Embedding inference in a Python applicationllama-cpp-python
CPU-only deployment (no GPU)Ollama or llama.cpp
Maximum throughput on large GPU clustervLLM with tensor parallelism

Checklist: Do You Understand This?

  • What is GGUF format and which tools use it?
  • When is LM Studio the right tool compared to Ollama?
  • What is PagedAttention and why does it give vLLM higher throughput?
  • What is the main difference between vLLM and Hugging Face TGI?
  • For a production API serving 100+ concurrent users, which tool would you choose?