Intermediate
Models in Ollama
The Ollama library contains 4,500+ models as of May 2026. This page helps you find the right one for your use case, understand the naming conventions, and pick a sensible quantization level.
Finding Models
Browse at ollama.com/library or search from the CLI:
# Search the library
ollama search llama
# See all available tags for a model
ollama show llama3.2 --modelinfo
Recommended Models by Use Case
| Use Case | Recommended Model | Size | Notes |
|---|---|---|---|
| General chat / first model | llama3.2 | 2B / 3B | Fast, capable, great starter. 3B runs on almost any machine. |
| Best overall quality (local) | llama4-scout | 109B MoE (17B active) | Meta's flagship local model. MoE means only 17B parameters fire per token — runs fast. |
| Reasoning / math / logic | deepseek-r1:7b | 7B | Chain-of-thought reasoning model. Significantly better at math than same-sized standard models. |
| Reasoning (larger) | deepseek-r1:14b | 14B | Step up from 7B if you have 16 GB VRAM. |
| Coding (best agentic) | devstral | 24B | Mistral's agentic coding model — built for multi-file edits and tool use. |
| Coding (fast) | qwen2.5-coder:7b | 7B | Qwen's coding specialist. Excellent at Python, TypeScript, and structured outputs. |
| Vision (image understanding) | llama3.2-vision | 11B | Accepts image + text. Good for OCR, image Q&A, and diagram description. |
| Vision (smaller) | gemma4:9b | 9B | Gemma 4 is natively multimodal — text, image, and audio. |
| Embedding / RAG | nomic-embed-text | ~274M | Fast, high-quality text embeddings for RAG pipelines. Used with pgvector or Chroma. |
| Smallest / low-RAM | phi4-mini | 3.8B | Microsoft's efficient model. Runs on 4 GB RAM. Good for simple tasks. |
| Multilingual | qwen3:8b | 8B | Strong multilingual support (29 languages). Also good for code. |
Understanding Model Names
Ollama model names follow a consistent pattern: name:tag. The tag specifies the size and quantization. Examples:
llama3.2 ← default tag (usually latest/recommended)
llama3.2:3b ← explicit size
llama3.2:3b-q4_K_M ← size + quantization level
deepseek-r1:7b ← model:size
qwen2.5-coder:14b-q8_0 ← 8-bit quantization (higher quality)
Quantization: Quality vs Speed vs VRAM
Quantization compresses model weights from 16-bit floats to smaller representations. Less VRAM, faster inference — with a small quality tradeoff. The naming convention comes from GGUF (the file format Ollama uses):
| Level | Bits | VRAM (7B) | Quality | Recommendation |
|---|---|---|---|---|
| Q2_K | 2-bit | ~2.5 GB | Noticeably degraded | Avoid — only for severely RAM-limited devices |
| Q4_K_M | 4-bit (mixed) | ~4.1 GB | Good — near full quality | ✅ Default choice for most users |
| Q5_K_M | 5-bit (mixed) | ~5.0 GB | Very good | Use if you have headroom; marginal gain over Q4_K_M |
| Q6_K | 6-bit | ~5.9 GB | Excellent | Worth it if VRAM allows |
| Q8_0 | 8-bit | ~7.7 GB | Near-lossless | Best quality that still fits 8 GB VRAM for 7B |
| FP16 | 16-bit | ~14 GB | Full quality | Only on 16+ GB VRAM; rarely needed for local use |
Practical rule of thumb
Use Q4_K_M as your default. It's the sweet spot: small enough to fit 7B in 5 GB VRAM, quality good enough for almost all tasks. Only go lower if you absolutely can't fit the model in RAM.
GPU Acceleration by Platform
NVIDIA CUDA
Driver 531+ required. Compute Capability 5.0+ (Maxwell+). Ollama bundles its own CUDA runtime — no separate CUDA Toolkit install needed. Fastest option for Windows/Linux.
AMD ROCm
ROCm v7 on Linux fully supported. Windows ROCm is v6.1 and experimental. If you have an AMD GPU on Linux, it works well. Windows users: check compatibility first.
Apple Metal
Automatic on all Apple Silicon (M1–M4). No configuration needed — install Ollama and Metal kicks in. Apple unified memory lets you run larger models than discrete GPU VRAM would allow.
Major Model Families Available
The Ollama library spans all major open-weight model families:
- Llama 4 (Meta) — Scout (109B MoE) and Maverick (400B MoE). Natively multimodal.
- Llama 3.x (Meta) — 3.2 3B/1B (fast, small), 3.2 Vision 11B (images), 3.3 70B (highest Llama 3 quality)
- Gemma 4 (Google) — 1B/4B/12B/27B. Natively multimodal (text + image + audio). Strong across all sizes.
- DeepSeek-R1 — 7B/14B/32B/70B variants. Chain-of-thought reasoning specialist.
- Qwen3 (Alibaba) — 8B/32B and MoE variants. Strong multilingual + coding support.
- Phi-4 (Microsoft) — 3.8B mini, 14B standard. Efficient, good instruction following.
- Mistral / Devstral — Mistral 7B, Mistral Small 24B, Devstral 24B (agentic coding)
- Kimi K2 (Moonshot) — MoE coding model with strong agentic capabilities
- nomic-embed-text — Dedicated embedding model for RAG pipelines
Checklist: Do You Understand This?
- Can you pick a model for general chat, coding, and reasoning from the table above?
- Do you understand what Q4_K_M means and why it's the default choice?
- Can you read a model tag like
qwen2.5-coder:14b-q8_0and know what it means? - Do you know which GPU backend applies to your machine (CUDA / ROCm / Metal)?