🧠 All Things AI
Intermediate

Running Models with Ollama

Ollama is the simplest way to run open-weight AI models locally. One command to install, one command to run a model. It provides an OpenAI-compatible API at localhost so your existing code works without changes.

Installation

Ollama supports macOS, Windows, and Linux. Platform-specific installers handle GPU driver detection and model storage automatically.

# macOS — download from ollama.com or:
brew install ollama

# Linux (most distributions):
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com
# Supports NVIDIA (CUDA) and AMD (ROCm) GPUs automatically

After installation, the Ollama service runs in the background (macOS: menubar icon; Linux: systemd service; Windows: system tray). No configuration needed for basic use.

Running Your First Model

# Pull and run llama3.2 (3B, ~2GB download)
ollama run llama3.2

# Run a specific quantisation
ollama run llama3.1:8b-instruct-q4_K_M

# Non-interactive: pipe input
echo "Explain RAG in one paragraph" | ollama run llama3.2

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

Model Library

Ollama's library (ollama.com/library) includes 100+ models. Key ones:

ModelPull commandSize (Q4)Good for
Llama 3.2 3Bollama pull llama3.2~2 GBFast, runs anywhere, simple tasks
Llama 3.1 8Bollama pull llama3.1~5 GBGeneral purpose, coding, chat
Llama 3.1 70Bollama pull llama3.1:70b~40 GBNear-frontier quality, complex tasks
Mistral 7Bollama pull mistral~5 GBEfficient, strong coding
DeepSeek-R1 7Bollama pull deepseek-r1:7b~5 GBReasoning, shows thinking trace
DeepSeek-R1 32Bollama pull deepseek-r1:32b~20 GBStrong reasoning, near o1 quality
Phi-3 Miniollama pull phi3:mini~2 GBEfficient, runs on low-end hardware
Gemma 2 9Bollama pull gemma2~6 GBGoogle model, strong instruction following
Qwen2.5 7Bollama pull qwen2.5~5 GBMultilingual, strong on Asian languages
nomic-embed-textollama pull nomic-embed-text~300 MBLocal embeddings for RAG

OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. Use it as a drop-in replacement for OpenAI in your code:

# Using the OpenAI Python SDK:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # any string works
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This means your existing code using the OpenAI SDK works locally with Ollama with just a base URL change — no rewriting needed.

Quantisation Options

Ollama model names include quantisation variants. The format ismodel:size-quant. Common options:

  • q4_0 — Smallest file; fastest; lowest quality. Use for very limited hardware.
  • q4_K_M — Best quality/speed trade-off at 4-bit. Default recommendation.
  • q5_K_M — Slightly better quality than Q4 at modest size increase.
  • q8_0 — Near-FP16 quality; requires twice the VRAM of Q4.
  • latest (default) — Ollama picks the best quantisation for your hardware.
# Pull specific quantisation:
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull llama3.1:70b-instruct-q4_0

Performance by Hardware

HardwareLlama 3.1 8B (Q4)Llama 3.1 70B (Q4)
M1 MacBook Pro (16GB)~50–80 tokens/secCPU only, ~5–10 tok/sec
M3 Max Mac Studio (64GB)~150–200 tokens/sec~30–50 tokens/sec
RTX 3080 (10GB VRAM)~120–160 tokens/secPartial GPU offload
RTX 4090 (24GB VRAM)~200–250 tokens/sec~60–80 tokens/sec (with RAM offload)

Modelfile: Custom Models

Modelfiles let you create custom model variants with baked-in system prompts, temperature settings, or base model overrides:

FROM llama3.1

SYSTEM """
You are a Python expert. All your responses must include working code.
Always explain what the code does after the code block.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
ollama create mymodel -f ./Modelfile
ollama run mymodel

Key Integrations

  • Open WebUI — ChatGPT-style UI for Ollama models;docker run -p 3000:8080 ghcr.io/open-webui/open-webui:ollama
  • LangChain / LlamaIndex — Both have native Ollama integrations for building RAG and agent applications
  • Continue.dev — VS Code extension; use Ollama as a local coding assistant (Copilot alternative)
  • Docker — Run Ollama in a container for reproducible environments:docker run -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Checklist: Do You Understand This?

  • How do you install Ollama and pull your first model?
  • What does the OpenAI-compatible API mean in practice for your existing code?
  • What is the recommended default quantisation for a balance of quality and speed?
  • How do you create a custom Modelfile and what can you configure in it?
  • Name two Ollama integrations for building chat UIs or RAG applications.