Intermediate

Running Models with Ollama

Ollama is the simplest way to run open-weight AI models locally. One command to install, one command to run a model. It provides an OpenAI-compatible API at localhost so your existing code works without changes.

Installation

Ollama supports macOS, Windows, and Linux. Platform-specific installers handle GPU driver detection and model storage automatically.

# macOS — download from ollama.com or:
brew install ollama

# Linux (most distributions):
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com
# Supports NVIDIA (CUDA) and AMD (ROCm) GPUs automatically

After installation, the Ollama service runs in the background (macOS: menubar icon; Linux: systemd service; Windows: system tray). No configuration needed for basic use.

Running Your First Model

# Pull and run llama3.2 (3B, ~2GB download)
ollama run llama3.2

# Run a specific quantisation
ollama run llama3.1:8b-instruct-q4_K_M

# Non-interactive: pipe input
echo "Explain RAG in one paragraph" | ollama run llama3.2

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

Model Library

Ollama's library (ollama.com/library) includes 100+ models. Key ones:

Model	Pull command	Size (Q4)	Good for
Llama 3.2 3B	`ollama pull llama3.2`	~2 GB	Fast, runs anywhere, simple tasks
Llama 3.1 8B	`ollama pull llama3.1`	~5 GB	General purpose, coding, chat
Llama 3.1 70B	`ollama pull llama3.1:70b`	~40 GB	Near-frontier quality, complex tasks
Mistral 7B	`ollama pull mistral`	~5 GB	Efficient, strong coding
DeepSeek-R1 7B	`ollama pull deepseek-r1:7b`	~5 GB	Reasoning, shows thinking trace
DeepSeek-R1 32B	`ollama pull deepseek-r1:32b`	~20 GB	Strong reasoning, near o1 quality
Phi-3 Mini	`ollama pull phi3:mini`	~2 GB	Efficient, runs on low-end hardware
Gemma 2 9B	`ollama pull gemma2`	~6 GB	Google model, strong instruction following
Qwen2.5 7B	`ollama pull qwen2.5`	~5 GB	Multilingual, strong on Asian languages
nomic-embed-text	`ollama pull nomic-embed-text`	~300 MB	Local embeddings for RAG

OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1. Use it as a drop-in replacement for OpenAI in your code:

# Using the OpenAI Python SDK:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # any string works
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This means your existing code using the OpenAI SDK works locally with Ollama with just a base URL change — no rewriting needed.

Quantisation Options

Ollama model names include quantisation variants. The format ismodel:size-quant. Common options:

q4_0 — Smallest file; fastest; lowest quality. Use for very limited hardware.
q4_K_M — Best quality/speed trade-off at 4-bit. Default recommendation.
q5_K_M — Slightly better quality than Q4 at modest size increase.
q8_0 — Near-FP16 quality; requires twice the VRAM of Q4.
latest (default) — Ollama picks the best quantisation for your hardware.

# Pull specific quantisation:
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull llama3.1:70b-instruct-q4_0

Performance by Hardware

Hardware	Llama 3.1 8B (Q4)	Llama 3.1 70B (Q4)
M1 MacBook Pro (16GB)	~50–80 tokens/sec	CPU only, ~5–10 tok/sec
M3 Max Mac Studio (64GB)	~150–200 tokens/sec	~30–50 tokens/sec
RTX 3080 (10GB VRAM)	~120–160 tokens/sec	Partial GPU offload
RTX 4090 (24GB VRAM)	~200–250 tokens/sec	~60–80 tokens/sec (with RAM offload)

Modelfile: Custom Models

Modelfiles let you create custom model variants with baked-in system prompts, temperature settings, or base model overrides:

FROM llama3.1

SYSTEM """
You are a Python expert. All your responses must include working code.
Always explain what the code does after the code block.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

ollama create mymodel -f ./Modelfile
ollama run mymodel

Key Integrations

Open WebUI — ChatGPT-style UI for Ollama models;docker run -p 3000:8080 ghcr.io/open-webui/open-webui:ollama
LangChain / LlamaIndex — Both have native Ollama integrations for building RAG and agent applications
Continue.dev — VS Code extension; use Ollama as a local coding assistant (Copilot alternative)
Docker — Run Ollama in a container for reproducible environments:docker run -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

Checklist: Do You Understand This?

How do you install Ollama and pull your first model?
What does the OpenAI-compatible API mean in practice for your existing code?
What is the recommended default quantisation for a balance of quality and speed?
How do you create a custom Modelfile and what can you configure in it?
Name two Ollama integrations for building chat UIs or RAG applications.