Intermediate

The Major Model Families

The AI model landscape in 2025–2026 is dominated by a handful of major families, each with distinct strengths, licensing models, and use-case sweet spots. This page covers each family in enough depth to make informed selection decisions.

Closed API — frontier quality, no weight access

GPT-4o / GPT-5

OpenAI — broadest ecosystem

Claude Sonnet / Opus

Anthropic — long context, coding

Gemini Flash / Pro

Google — 1M context, multimodal

Reasoning models — extended thinking, harder tasks

o3 / o4-mini

OpenAI — native tool use in trace

Claude extended thinking

Anthropic — 200K + thinking

DeepSeek-R1

Open-weight, matches o1

Open-weight — download & self-host

Llama 4 Maverick

Meta — 10M context, MoE

Qwen 2.5 (72B)

Alibaba — multilingual, coding

Mistral / Mixtral

Mistral — EU, compact, efficient

Phi-4 (14B)

Microsoft — edge / mobile

Closed API = pay-per-token, zero ops. Open-weight = fixed infra cost, full control. Most production systems use both.

OpenAI — GPT and o-Series

OpenAI maintains two parallel model lines for different needs:

GPT Series

Model	Context	Strengths	Best for
GPT-4o	128K	Multimodal, fast, broad capability	Default workhorse for most tasks
GPT-4o mini	128K	Fast, very cheap	High-volume simple tasks
GPT-5	400K	Highest general capability, reduced hallucination	Professional knowledge work, hardest general tasks

o-Series (Reasoning)

OpenAI's separate reasoning-focused family. Spends additional compute "thinking" before answering. Dramatically better at maths, formal logic, and complex code. See the Reasoning Models section for full detail.

o3 — Full reasoning with native tool use; best quality on hardest problems
o4-mini — Cost-efficient reasoning; on benchmarks often matches o3 at 1/9th the cost

OpenAI's core strength: Largest developer ecosystem, broadest tool integration (Assistants API, function calling, fine-tuning), most mature production infrastructure.

Anthropic — Claude Family

Anthropic's Claude models are known for instruction-following quality, long-context accuracy, coding reliability, and safety-focused behaviour. The naming convention:Haiku (fast/cheap) → Sonnet (balanced) →Opus (most capable).

Model	Context	Strengths
Claude Haiku 4.5	200K	Fastest Claude, very cheap, good for routing and classification
Claude Sonnet 4.5	200K	Best all-round value: coding, analysis, writing, agentic tasks
Claude Opus 4.6	200K	Most capable Claude; extended thinking mode; research-grade tasks

Claude's core strengths: Extremely long and accurate context handling (200K native), strong coding reliability (consistently top-rated on SWE-bench), "computer use" for browser/desktop automation, and strong agentic tool-calling behaviour. Claude Code (Anthropic's agentic coding tool) is built on Claude Sonnet/Opus.

Google — Gemini Family

Google's Gemini family is defined by its multimodal capability and extreme context lengths.

Model	Context	Strengths
Gemini Flash 2.5	1M	Very fast, 1M context, multimodal, cheap
Gemini Pro 2.5	1M	Strong coding, reasoning, multimodal; 1M context
Gemini Ultra / 3.x	1M+	Frontier capability, visual and audio reasoning

Gemini's core strengths: The 1M token context window is the largest in production; invaluable for whole-codebase analysis, long book summarisation, or processing hundreds of documents at once. Strong multimodal — handles images, audio, and (for some models) video natively. Deep Google Workspace integration.

Meta — LLaMA Family (Open-Weight)

LLaMA is the world's most widely used open-weight model family. Meta releases model weights freely for research and commercial use (check specific version licences).

Model	Parameters	Key features
Llama 3.1 8B	8B	Runs on consumer GPU; good general reasoning
Llama 3.1 70B	70B	Near-frontier quality; runs on 2× consumer GPUs
Llama 3.1 405B	405B	Strong dense model; requires data centre GPU; superseded by Llama 4 for most uses
Llama 4 Scout	109B MoE (17B active)	Best open-weight quality in class (Apr 2026); 10M context; multimodal vision; fits one H100
Llama 4 Maverick	400B MoE (17B active)	Matches GPT-4o on benchmarks; multimodal; inference cost of a 17B model; production API use

LLaMA's core strength: No per-token cost, full data control, fine-tuning freedom. The massive open-source ecosystem (Ollama, llama.cpp, vLLM) makes deployment straightforward. Llama 4 achieves 85–86% MMLU-Pro — matching or approaching proprietary frontier models on many benchmarks.

Mistral — Compact European Models

Mistral AI (Paris) makes high-efficiency open-weight models with a strong commercial presence in Europe:

Mistral 7B / Nemo (12B) — Punches above its weight for size; particularly strong on coding and instruction-following
Mixtral 8x7B / 8x22B — Mixture-of-Experts models; only a subset of experts active per token, making them cost-efficient
Mistral Large 3 — 675B total, 41B active (sparse MoE); closed API; competitive with frontier models on European-language tasks; Apache 2.0 for smaller models

Mistral's niche: European regulatory comfort (French company, EU-hosted options), strong multilingual European languages, and the most efficient open models for their quality tier.

DeepSeek — Chinese Open-Weight Leader

DeepSeek has produced the most impactful open-weight releases of 2024–2025:

DeepSeek-V3 / V3.2 — Non-reasoning model; 671B MoE, 37B active; strong on coding and general tasks; very cheap API ($0.27/1M input tokens for V3)
DeepSeek-R1 — Open-weight reasoning model that matches o1; full reasoning trace visible; distilled versions available for local deployment

DeepSeek's impact: R1's January 2025 release caused a "global cost reset" — demonstrating frontier reasoning capability is achievable without massive closed-source infrastructure. However: data privacy and safety alignment considerations apply (Chinese training and governance).

Microsoft — Phi (Small Language Models)

Microsoft's Phi family focuses on efficiency at small scale:

Phi-3 Mini (3.8B) — Runs on phones; strong reasoning per parameter
Phi-3 Small (7B) / Medium (14B) — Strong coding; can run on laptop GPU
Phi-4 — Improved quality; strong on STEM tasks for its size

Phi's niche: Edge deployment (Android, iOS, embedded), offline applications, environments where even a 7B model is too large.

Alibaba — Qwen (Multilingual Open Models)

Qwen (from Alibaba Cloud) is the leading open-weight model family for multilingual tasks, especially Chinese and Asian languages:

Qwen2.5 7B / 14B / 32B / 72B — Strong instruction-following, coding, and math at each size tier
Qwen2.5-Coder — Code-specialised variant; competitive with DeepSeek-Coder
QwQ-32B — Reasoning-capable model comparable to some o1 tasks

Qwen's niche: Applications serving Chinese or East Asian markets; base models for DeepSeek-R1 distillation (R1's distilled models use Qwen2.5 as the base architecture).

Reading Model Naming Conventions

Common patterns you'll encounter across families:

Size suffix (7B, 70B, 405B) — Billions of parameters. Larger = more capable but more compute to run
Instruct / Chat / Base — "Instruct" or "Chat" = fine-tuned to follow instructions. "Base" = raw pre-trained weights (not for end users)
Q4, Q8 (quantisation) — Weight precision reduced to save memory. Q4 = 4-bit; sacrifices some quality for much smaller file size
GGUF — File format for local inference via llama.cpp/Ollama
MoE (Mixture of Experts) — Total params / Active params notation (e.g., 8x7B = 8 experts of 7B each; only 2 active per token = 14B active)

Checklist: Do You Understand This?

What is OpenAI's o-series and how does it differ from the GPT series?
What is Claude's main differentiator versus GPT-4o for production use cases?
When would you choose Gemini over Claude or GPT-4o?
What does "open-weight" mean in the context of Llama 4 or Qwen?
Why did DeepSeek-R1's release matter beyond just being another model?
What does "Q4" mean in a model file name?