Running Models with Ollama
Ollama is the simplest way to run open-weight AI models locally. One command to install, one command to run a model. It provides an OpenAI-compatible API at localhost so your existing code works without changes.
Installation
Ollama supports macOS, Windows, and Linux. Platform-specific installers handle GPU driver detection and model storage automatically.
# macOS — download from ollama.com or:
brew install ollama
# Linux (most distributions):
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com
# Supports NVIDIA (CUDA) and AMD (ROCm) GPUs automaticallyAfter installation, the Ollama service runs in the background (macOS: menubar icon; Linux: systemd service; Windows: system tray). No configuration needed for basic use.
Running Your First Model
# Pull and run llama3.2 (3B, ~2GB download)
ollama run llama3.2
# Run a specific quantisation
ollama run llama3.1:8b-instruct-q4_K_M
# Non-interactive: pipe input
echo "Explain RAG in one paragraph" | ollama run llama3.2
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.2Model Library
Ollama's library (ollama.com/library) includes 100+ models. Key ones:
| Model | Pull command | Size (Q4) | Good for |
|---|---|---|---|
| Llama 3.2 3B | ollama pull llama3.2 | ~2 GB | Fast, runs anywhere, simple tasks |
| Llama 3.1 8B | ollama pull llama3.1 | ~5 GB | General purpose, coding, chat |
| Llama 3.1 70B | ollama pull llama3.1:70b | ~40 GB | Near-frontier quality, complex tasks |
| Mistral 7B | ollama pull mistral | ~5 GB | Efficient, strong coding |
| DeepSeek-R1 7B | ollama pull deepseek-r1:7b | ~5 GB | Reasoning, shows thinking trace |
| DeepSeek-R1 32B | ollama pull deepseek-r1:32b | ~20 GB | Strong reasoning, near o1 quality |
| Phi-3 Mini | ollama pull phi3:mini | ~2 GB | Efficient, runs on low-end hardware |
| Gemma 2 9B | ollama pull gemma2 | ~6 GB | Google model, strong instruction following |
| Qwen2.5 7B | ollama pull qwen2.5 | ~5 GB | Multilingual, strong on Asian languages |
| nomic-embed-text | ollama pull nomic-embed-text | ~300 MB | Local embeddings for RAG |
OpenAI-Compatible API
Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. Use it as a drop-in replacement for OpenAI in your code:
# Using the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # any string works
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)This means your existing code using the OpenAI SDK works locally with Ollama with just a base URL change — no rewriting needed.
Quantisation Options
Ollama model names include quantisation variants. The format ismodel:size-quant. Common options:
- q4_0 — Smallest file; fastest; lowest quality. Use for very limited hardware.
- q4_K_M — Best quality/speed trade-off at 4-bit. Default recommendation.
- q5_K_M — Slightly better quality than Q4 at modest size increase.
- q8_0 — Near-FP16 quality; requires twice the VRAM of Q4.
- latest (default) — Ollama picks the best quantisation for your hardware.
# Pull specific quantisation:
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull llama3.1:70b-instruct-q4_0Performance by Hardware
| Hardware | Llama 3.1 8B (Q4) | Llama 3.1 70B (Q4) |
|---|---|---|
| M1 MacBook Pro (16GB) | ~50–80 tokens/sec | CPU only, ~5–10 tok/sec |
| M3 Max Mac Studio (64GB) | ~150–200 tokens/sec | ~30–50 tokens/sec |
| RTX 3080 (10GB VRAM) | ~120–160 tokens/sec | Partial GPU offload |
| RTX 4090 (24GB VRAM) | ~200–250 tokens/sec | ~60–80 tokens/sec (with RAM offload) |
Modelfile: Custom Models
Modelfiles let you create custom model variants with baked-in system prompts, temperature settings, or base model overrides:
FROM llama3.1
SYSTEM """
You are a Python expert. All your responses must include working code.
Always explain what the code does after the code block.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 4096ollama create mymodel -f ./Modelfile
ollama run mymodelKey Integrations
- Open WebUI — ChatGPT-style UI for Ollama models;
docker run -p 3000:8080 ghcr.io/open-webui/open-webui:ollama - LangChain / LlamaIndex — Both have native Ollama integrations for building RAG and agent applications
- Continue.dev — VS Code extension; use Ollama as a local coding assistant (Copilot alternative)
- Docker — Run Ollama in a container for reproducible environments:
docker run -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Checklist: Do You Understand This?
- How do you install Ollama and pull your first model?
- What does the OpenAI-compatible API mean in practice for your existing code?
- What is the recommended default quantisation for a balance of quality and speed?
- How do you create a custom Modelfile and what can you configure in it?
- Name two Ollama integrations for building chat UIs or RAG applications.