Intermediate

Models in Ollama

The Ollama library contains 4,500+ models as of May 2026. This page helps you find the right one for your use case, understand the naming conventions, and pick a sensible quantization level.

Finding Models

Browse at ollama.com/library or search from the CLI:

# Search the library
ollama search llama
# See all available tags for a model
ollama show llama3.2 --modelinfo
Use CaseRecommended ModelSizeNotes
General chat / first modelllama3.22B / 3BFast, capable, great starter. 3B runs on almost any machine.
Best overall quality (local)llama4-scout109B MoE (17B active)Meta's flagship local model. MoE means only 17B parameters fire per token — runs fast.
Reasoning / math / logicdeepseek-r1:7b7BChain-of-thought reasoning model. Significantly better at math than same-sized standard models.
Reasoning (larger)deepseek-r1:14b14BStep up from 7B if you have 16 GB VRAM.
Coding (best agentic)devstral24BMistral's agentic coding model — built for multi-file edits and tool use.
Coding (fast)qwen2.5-coder:7b7BQwen's coding specialist. Excellent at Python, TypeScript, and structured outputs.
Vision (image understanding)llama3.2-vision11BAccepts image + text. Good for OCR, image Q&A, and diagram description.
Vision (smaller)gemma4:9b9BGemma 4 is natively multimodal — text, image, and audio.
Embedding / RAGnomic-embed-text~274MFast, high-quality text embeddings for RAG pipelines. Used with pgvector or Chroma.
Smallest / low-RAMphi4-mini3.8BMicrosoft's efficient model. Runs on 4 GB RAM. Good for simple tasks.
Multilingualqwen3:8b8BStrong multilingual support (29 languages). Also good for code.

Understanding Model Names

Ollama model names follow a consistent pattern: name:tag. The tag specifies the size and quantization. Examples:

llama3.2 ← default tag (usually latest/recommended)
llama3.2:3b ← explicit size
llama3.2:3b-q4_K_M ← size + quantization level
deepseek-r1:7b ← model:size
qwen2.5-coder:14b-q8_0 ← 8-bit quantization (higher quality)

Quantization: Quality vs Speed vs VRAM

Quantization compresses model weights from 16-bit floats to smaller representations. Less VRAM, faster inference — with a small quality tradeoff. The naming convention comes from GGUF (the file format Ollama uses):

LevelBitsVRAM (7B)QualityRecommendation
Q2_K2-bit~2.5 GBNoticeably degradedAvoid — only for severely RAM-limited devices
Q4_K_M4-bit (mixed)~4.1 GBGood — near full quality✅ Default choice for most users
Q5_K_M5-bit (mixed)~5.0 GBVery goodUse if you have headroom; marginal gain over Q4_K_M
Q6_K6-bit~5.9 GBExcellentWorth it if VRAM allows
Q8_08-bit~7.7 GBNear-losslessBest quality that still fits 8 GB VRAM for 7B
FP1616-bit~14 GBFull qualityOnly on 16+ GB VRAM; rarely needed for local use
Practical rule of thumb
Use Q4_K_M as your default. It's the sweet spot: small enough to fit 7B in 5 GB VRAM, quality good enough for almost all tasks. Only go lower if you absolutely can't fit the model in RAM.

GPU Acceleration by Platform

NVIDIA CUDA
Driver 531+ required. Compute Capability 5.0+ (Maxwell+). Ollama bundles its own CUDA runtime — no separate CUDA Toolkit install needed. Fastest option for Windows/Linux.
AMD ROCm
ROCm v7 on Linux fully supported. Windows ROCm is v6.1 and experimental. If you have an AMD GPU on Linux, it works well. Windows users: check compatibility first.
Apple Metal
Automatic on all Apple Silicon (M1–M4). No configuration needed — install Ollama and Metal kicks in. Apple unified memory lets you run larger models than discrete GPU VRAM would allow.

Major Model Families Available

The Ollama library spans all major open-weight model families:

  • Llama 4 (Meta) — Scout (109B MoE) and Maverick (400B MoE). Natively multimodal.
  • Llama 3.x (Meta) — 3.2 3B/1B (fast, small), 3.2 Vision 11B (images), 3.3 70B (highest Llama 3 quality)
  • Gemma 4 (Google) — 1B/4B/12B/27B. Natively multimodal (text + image + audio). Strong across all sizes.
  • DeepSeek-R1 — 7B/14B/32B/70B variants. Chain-of-thought reasoning specialist.
  • Qwen3 (Alibaba) — 8B/32B and MoE variants. Strong multilingual + coding support.
  • Phi-4 (Microsoft) — 3.8B mini, 14B standard. Efficient, good instruction following.
  • Mistral / Devstral — Mistral 7B, Mistral Small 24B, Devstral 24B (agentic coding)
  • Kimi K2 (Moonshot) — MoE coding model with strong agentic capabilities
  • nomic-embed-text — Dedicated embedding model for RAG pipelines

Checklist: Do You Understand This?

  • Can you pick a model for general chat, coding, and reasoning from the table above?
  • Do you understand what Q4_K_M means and why it's the default choice?
  • Can you read a model tag like qwen2.5-coder:14b-q8_0 and know what it means?
  • Do you know which GPU backend applies to your machine (CUDA / ROCm / Metal)?

Page built: 01 Jun 2026