Getting Started with Ollama
From zero to running a local LLM in about five minutes. No cloud account, no API key, no complicated setup.
Step 1 — Install
Download the .dmg from ollama.com — drag to Applications and open. Ollama appears in your menu bar.
curl -fsSL https://ollama.com/install.sh | sh — installs and starts the Ollama service automatically.
Download the .exe installer from ollama.com. Runs as a background service after install. Native ARM64 build available for Copilot+ PCs.
After install, Ollama runs as a background service listening on http://localhost:11434. You don't need to start it manually — it's always ready.
Step 2 — Pull a Model
Open a terminal. Pull your first model with ollama pull:
The first pull downloads the model weights (2–8 GB typical for small models). Subsequent runs are instant — the model is cached locally.
Step 3 — Run Interactively
Type /bye or press Ctrl+D to exit. Type /help inside the chat for commands like /clear (reset context) and /show info (model details).
Key CLI Commands
| Command | What it does |
|---|---|
| ollama pull <model> | Download a model from the library |
| ollama run <model> | Start an interactive chat session |
| ollama list | Show all locally downloaded models |
| ollama show <model> | Show model details (size, context, Modelfile) |
| ollama ps | Show models currently loaded in memory |
| ollama rm <model> | Delete a model from disk |
| ollama serve | Start the Ollama API server manually (runs automatically on install) |
| ollama run <model> <prompt> | One-shot prompt without entering interactive mode |
Hardware Requirements
Ollama will run on almost any machine, but GPU matters for speed. Here's a practical guide:
| Hardware | What you can run | Speed |
|---|---|---|
| CPU only (16 GB RAM) | 3B–7B models (Q4) | 5–15 tokens/sec — usable but slow |
| 8 GB VRAM (e.g. RTX 3060) | 7B–8B models comfortably | 40–80 tokens/sec |
| 16 GB VRAM (e.g. RTX 4080) | 13B–14B models | 60–120 tokens/sec |
| 24 GB VRAM (e.g. RTX 4090) | 32B models, or 70B partially offloaded | 100–200 tokens/sec on 32B |
| Apple M2/M3 (16 GB unified) | 13B–27B models using shared memory | 60–100 tokens/sec — excellent value |
| Apple M3 Max / M4 Max (128 GB) | 70B+ models fully in memory | 50–80 tokens/sec on 70B |
Modelfile — Customizing a Model
A Modelfile lets you customize any base model with a system prompt, parameter settings, and example conversations. It works like a Dockerfile — you define a base and layer on modifications, then create a named custom model.
Key Modelfile instructions:
FROM— base model to build onSYSTEM— persistent system prompt (baked into the model config)PARAMETER temperature— creativity (0 = deterministic, 1 = creative)PARAMETER num_ctx— context window size in tokensMESSAGE— pre-load conversation history (few-shot examples)ADAPTER— attach a LoRA adapter to the base model
Next Steps
- Browse available models → Models in Ollama
- Use Ollama from your own app → Using the Ollama API
- Add a chat UI → install Open WebUI (
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main)
Checklist: Do You Understand This?
- Can you install Ollama and pull a model from memory?
- Do you know the difference between
ollama pullandollama run? - Can you create a Modelfile with a custom system prompt and temperature?
- Do you know what hardware you'd need to run a 13B model at reasonable speed?