🧠 All Things AI
Intermediate

The Major Model Families

The AI model landscape in 2025–2026 is dominated by a handful of major families, each with distinct strengths, licensing models, and use-case sweet spots. This page covers each family in enough depth to make informed selection decisions.

Closed API — frontier quality, no weight access
GPT-4o / GPT-5
OpenAI — broadest ecosystem
Claude Sonnet / Opus
Anthropic — long context, coding
Gemini Flash / Pro
Google — 1M context, multimodal
Reasoning models — extended thinking, harder tasks
o3 / o4-mini
OpenAI — native tool use in trace
Claude extended thinking
Anthropic — 200K + thinking
DeepSeek-R1
Open-weight, matches o1
Open-weight — download & self-host
Llama 4 Maverick
Meta — 10M context, MoE
Qwen 2.5 (72B)
Alibaba — multilingual, coding
Mistral / Mixtral
Mistral — EU, compact, efficient
Phi-4 (14B)
Microsoft — edge / mobile

Closed API = pay-per-token, zero ops. Open-weight = fixed infra cost, full control. Most production systems use both.

OpenAI — GPT and o-Series

OpenAI maintains two parallel model lines for different needs:

GPT Series

ModelContextStrengthsBest for
GPT-4o128KMultimodal, fast, broad capabilityDefault workhorse for most tasks
GPT-4o mini128KFast, very cheapHigh-volume simple tasks
GPT-5400KHighest general capability, reduced hallucinationProfessional knowledge work, hardest general tasks

o-Series (Reasoning)

OpenAI's separate reasoning-focused family. Spends additional compute "thinking" before answering. Dramatically better at maths, formal logic, and complex code. See the Reasoning Models section for full detail.

  • o3 — Full reasoning with native tool use; best quality on hardest problems
  • o4-mini — Cost-efficient reasoning; on benchmarks often matches o3 at 1/9th the cost

OpenAI's core strength: Largest developer ecosystem, broadest tool integration (Assistants API, function calling, fine-tuning), most mature production infrastructure.

Anthropic — Claude Family

Anthropic's Claude models are known for instruction-following quality, long-context accuracy, coding reliability, and safety-focused behaviour. The naming convention:Haiku (fast/cheap) → Sonnet (balanced) →Opus (most capable).

ModelContextStrengths
Claude Haiku 4.5200KFastest Claude, very cheap, good for routing and classification
Claude Sonnet 4.5200KBest all-round value: coding, analysis, writing, agentic tasks
Claude Opus 4.6200KMost capable Claude; extended thinking mode; research-grade tasks

Claude's core strengths: Extremely long and accurate context handling (200K native), strong coding reliability (consistently top-rated on SWE-bench), "computer use" for browser/desktop automation, and strong agentic tool-calling behaviour. Claude Code (Anthropic's agentic coding tool) is built on Claude Sonnet/Opus.

Google — Gemini Family

Google's Gemini family is defined by its multimodal capability and extreme context lengths.

ModelContextStrengths
Gemini Flash 2.51MVery fast, 1M context, multimodal, cheap
Gemini Pro 2.51MStrong coding, reasoning, multimodal; 1M context
Gemini Ultra / 3.x1M+Frontier capability, visual and audio reasoning

Gemini's core strengths: The 1M token context window is the largest in production; invaluable for whole-codebase analysis, long book summarisation, or processing hundreds of documents at once. Strong multimodal — handles images, audio, and (for some models) video natively. Deep Google Workspace integration.

Meta — LLaMA Family (Open-Weight)

LLaMA is the world's most widely used open-weight model family. Meta releases model weights freely for research and commercial use (check specific version licences).

ModelParametersKey features
Llama 3.1 8B8BRuns on consumer GPU; good general reasoning
Llama 3.1 70B70BNear-frontier quality; runs on 2× consumer GPUs
Llama 3.1 405B405BBest open-weight general model; requires data centre GPU
Llama 4 Scout/MaverickMoE ~400B totalMultimodal (vision), up to 10M token context, MoE efficiency

LLaMA's core strength: No per-token cost, full data control, fine-tuning freedom. The massive open-source ecosystem (Ollama, llama.cpp, vLLM) makes deployment straightforward. Llama 4 achieves 85–86% MMLU-Pro — matching or approaching proprietary frontier models on many benchmarks.

Mistral — Compact European Models

Mistral AI (Paris) makes high-efficiency open-weight models with a strong commercial presence in Europe:

  • Mistral 7B / Nemo (12B) — Punches above its weight for size; particularly strong on coding and instruction-following
  • Mixtral 8x7B / 8x22B — Mixture-of-Experts models; only a subset of experts active per token, making them cost-efficient
  • Mistral Large — Closed API model; competitive with GPT-4 tier on European-language tasks

Mistral's niche: European regulatory comfort (French company, EU-hosted options), strong multilingual European languages, and the most efficient open models for their quality tier.

DeepSeek — Chinese Open-Weight Leader

DeepSeek has produced the most impactful open-weight releases of 2024–2025:

  • DeepSeek-V3 / V3.2 — Non-reasoning model; 671B MoE, 37B active; strong on coding and general tasks; very cheap API ($0.27/1M input tokens for V3)
  • DeepSeek-R1 — Open-weight reasoning model that matches o1; full reasoning trace visible; distilled versions available for local deployment

DeepSeek's impact: R1's January 2025 release caused a "global cost reset" — demonstrating frontier reasoning capability is achievable without massive closed-source infrastructure. However: data privacy and safety alignment considerations apply (Chinese training and governance).

Microsoft — Phi (Small Language Models)

Microsoft's Phi family focuses on efficiency at small scale:

  • Phi-3 Mini (3.8B) — Runs on phones; strong reasoning per parameter
  • Phi-3 Small (7B) / Medium (14B) — Strong coding; can run on laptop GPU
  • Phi-4 — Improved quality; strong on STEM tasks for its size

Phi's niche: Edge deployment (Android, iOS, embedded), offline applications, environments where even a 7B model is too large.

Alibaba — Qwen (Multilingual Open Models)

Qwen (from Alibaba Cloud) is the leading open-weight model family for multilingual tasks, especially Chinese and Asian languages:

  • Qwen2.5 7B / 14B / 32B / 72B — Strong instruction-following, coding, and math at each size tier
  • Qwen2.5-Coder — Code-specialised variant; competitive with DeepSeek-Coder
  • QwQ-32B — Reasoning-capable model comparable to some o1 tasks

Qwen's niche: Applications serving Chinese or East Asian markets; base models for DeepSeek-R1 distillation (R1's distilled models use Qwen2.5 as the base architecture).

Reading Model Naming Conventions

Common patterns you'll encounter across families:

  • Size suffix (7B, 70B, 405B) — Billions of parameters. Larger = more capable but more compute to run
  • Instruct / Chat / Base — "Instruct" or "Chat" = fine-tuned to follow instructions. "Base" = raw pre-trained weights (not for end users)
  • Q4, Q8 (quantisation) — Weight precision reduced to save memory. Q4 = 4-bit; sacrifices some quality for much smaller file size
  • GGUF — File format for local inference via llama.cpp/Ollama
  • MoE (Mixture of Experts) — Total params / Active params notation (e.g., 8x7B = 8 experts of 7B each; only 2 active per token = 14B active)

Checklist: Do You Understand This?

  • What is OpenAI's o-series and how does it differ from the GPT series?
  • What is Claude's main differentiator versus GPT-4o for production use cases?
  • When would you choose Gemini over Claude or GPT-4o?
  • What does "open-weight" mean in the context of Llama 4 or Qwen?
  • Why did DeepSeek-R1's release matter beyond just being another model?
  • What does "Q4" mean in a model file name?