Intermediate

Models in Ollama

The Ollama library contains 4,500+ models as of May 2026. This page helps you find the right one for your use case, understand the naming conventions, and pick a sensible quantization level.

Finding Models

Browse at ollama.com/library or search from the CLI:

# Search the library

ollama search llama

# See all available tags for a model

ollama show llama3.2 --modelinfo

Recommended Models by Use Case

Use Case	Recommended Model	Size	Notes
General chat / first model	llama3.2	2B / 3B	Fast, capable, great starter. 3B runs on almost any machine.
Best overall quality (local)	llama4-scout	109B MoE (17B active)	Meta's flagship local model. MoE means only 17B parameters fire per token — runs fast.
Reasoning / math / logic	deepseek-r1:7b	7B	Chain-of-thought reasoning model. Significantly better at math than same-sized standard models.
Reasoning (larger)	deepseek-r1:14b	14B	Step up from 7B if you have 16 GB VRAM.
Coding (best agentic)	devstral	24B	Mistral's agentic coding model — built for multi-file edits and tool use.
Coding (fast)	qwen2.5-coder:7b	7B	Qwen's coding specialist. Excellent at Python, TypeScript, and structured outputs.
Vision (image understanding)	llama3.2-vision	11B	Accepts image + text. Good for OCR, image Q&A, and diagram description.
Vision (smaller)	gemma4:9b	9B	Gemma 4 is natively multimodal — text, image, and audio.
Embedding / RAG	nomic-embed-text	~274M	Fast, high-quality text embeddings for RAG pipelines. Used with pgvector or Chroma.
Smallest / low-RAM	phi4-mini	3.8B	Microsoft's efficient model. Runs on 4 GB RAM. Good for simple tasks.
Multilingual	qwen3:8b	8B	Strong multilingual support (29 languages). Also good for code.

Understanding Model Names

Ollama model names follow a consistent pattern: name:tag. The tag specifies the size and quantization. Examples:

llama3.2 ← default tag (usually latest/recommended)

llama3.2:3b ← explicit size

llama3.2:3b-q4_K_M ← size + quantization level

deepseek-r1:7b ← model:size

qwen2.5-coder:14b-q8_0 ← 8-bit quantization (higher quality)

Quantization: Quality vs Speed vs VRAM

Quantization compresses model weights from 16-bit floats to smaller representations. Less VRAM, faster inference — with a small quality tradeoff. The naming convention comes from GGUF (the file format Ollama uses):

Level	Bits	VRAM (7B)	Quality	Recommendation
Q2_K	2-bit	~2.5 GB	Noticeably degraded	Avoid — only for severely RAM-limited devices
Q4_K_M	4-bit (mixed)	~4.1 GB	Good — near full quality	✅ Default choice for most users
Q5_K_M	5-bit (mixed)	~5.0 GB	Very good	Use if you have headroom; marginal gain over Q4_K_M
Q6_K	6-bit	~5.9 GB	Excellent	Worth it if VRAM allows
Q8_0	8-bit	~7.7 GB	Near-lossless	Best quality that still fits 8 GB VRAM for 7B
FP16	16-bit	~14 GB	Full quality	Only on 16+ GB VRAM; rarely needed for local use

Practical rule of thumb

Use Q4_K_M as your default. It's the sweet spot: small enough to fit 7B in 5 GB VRAM, quality good enough for almost all tasks. Only go lower if you absolutely can't fit the model in RAM.

GPU Acceleration by Platform

NVIDIA CUDA

Driver 531+ required. Compute Capability 5.0+ (Maxwell+). Ollama bundles its own CUDA runtime — no separate CUDA Toolkit install needed. Fastest option for Windows/Linux.

AMD ROCm

ROCm v7 on Linux fully supported. Windows ROCm is v6.1 and experimental. If you have an AMD GPU on Linux, it works well. Windows users: check compatibility first.

Apple Metal

Automatic on all Apple Silicon (M1–M4). No configuration needed — install Ollama and Metal kicks in. Apple unified memory lets you run larger models than discrete GPU VRAM would allow.

Major Model Families Available

The Ollama library spans all major open-weight model families:

Llama 4 (Meta) — Scout (109B MoE) and Maverick (400B MoE). Natively multimodal.
Llama 3.x (Meta) — 3.2 3B/1B (fast, small), 3.2 Vision 11B (images), 3.3 70B (highest Llama 3 quality)
Gemma 4 (Google) — 1B/4B/12B/27B. Natively multimodal (text + image + audio). Strong across all sizes.
DeepSeek-R1 — 7B/14B/32B/70B variants. Chain-of-thought reasoning specialist.
Qwen3 (Alibaba) — 8B/32B and MoE variants. Strong multilingual + coding support.
Phi-4 (Microsoft) — 3.8B mini, 14B standard. Efficient, good instruction following.
Mistral / Devstral — Mistral 7B, Mistral Small 24B, Devstral 24B (agentic coding)
Kimi K2 (Moonshot) — MoE coding model with strong agentic capabilities
nomic-embed-text — Dedicated embedding model for RAG pipelines

Checklist: Do You Understand This?

Can you pick a model for general chat, coding, and reasoning from the table above?
Do you understand what Q4_K_M means and why it's the default choice?
Can you read a model tag like qwen2.5-coder:14b-q8_0 and know what it means?
Do you know which GPU backend applies to your machine (CUDA / ROCm / Metal)?