Intermediate

Choosing the Right Model

With dozens of capable models across multiple providers, model selection is a genuine engineering decision — not just defaulting to "use GPT-4." This page gives you a systematic framework for matching your task requirements to the right model tier, provider, and configuration.

The Model Tier Framework

Think of models in four capability tiers. Most selection decisions reduce to picking the right tier first, then the specific model within that tier:

Efficient / Mini Tier

Examples: Claude Haiku, Gemini Flash, GPT-4o mini
Best for: Classification, simple extraction, routing decisions, high-volume tasks where cost matters more than peak quality
Cost: $0.10–$0.50 per million input tokens
Latency: Very fast (sub-second to first token)

Flagship / Balanced Tier

Examples: Claude Sonnet, Gemini Pro, GPT-4o
Best for: General-purpose work — writing, coding, analysis, chat, document processing. Default tier for most production applications.
Cost: $2–5 per million input tokens
Latency: Fast (1–3 seconds to first token)

Frontier Tier

Examples: Claude Opus, GPT-5, Gemini Ultra
Best for: Hardest general-purpose tasks, research-grade work, highest-quality output requirements
Cost: $10–30+ per million input tokens
Latency: Moderate (2–8 seconds to first token)

Reasoning Tier

Examples: o3, o4-mini, DeepSeek-R1, Claude extended thinking
Best for: Maths, formal logic, complex code, multi-step planning. Dramatically better on hard reasoning; expensive and slow for simple tasks.
Cost: $1–40+ per million tokens (varies enormously with thinking budget)
Latency: Slow (15 seconds to 5 minutes per query)

Task Routing by Query Type

Task type	Recommended tier	Notes
Classification / routing	Efficient	Often a 7B local model works; no need for frontier
Simple extraction / summarisation	Efficient–Flagship	Depends on document complexity
Writing / editing	Flagship	Frontier adds marginal quality for most writing
Code generation (simple)	Flagship	Claude Sonnet or GPT-4o suffice for most code
Code generation (complex algorithms)	Frontier or Reasoning	Try Frontier first; escalate to Reasoning if needed
Hard maths / logic	Reasoning	Reasoning models are dramatically better here
Document Q&A (large docs)	Flagship with long context	Gemini 2.5 Pro at 1M tokens; or RAG for very large corpora
Multi-step agentic planning	Reasoning or Frontier	Reasoning models reduce cascading errors in agents
Image analysis	Flagship	All flagship models handle images; Gemini for video
Real-time voice	Efficient + fast TTS	Latency is paramount; use fastest available STT+LLM+TTS

Context Window Requirements

Context window = the total tokens (input + output) the model can process at once. Choose based on your actual data size:

Standard (128K–200K) — Most flagship models. Handles documents up to ~300–500 pages, long conversation histories, moderate codebases.
Extended (1M tokens) — Gemini 2.5 Pro. Handles entire codebases (100K+ lines), very long books, hundreds of documents simultaneously.
Ultra-long (10M tokens) — Llama 4 Scout (experimental). Full document libraries; most use cases don't need this.

Longer context ≠ better quality. Models often degrade on information in the middle of very long contexts ("lost in the middle" problem). For large knowledge bases, RAG frequently outperforms stuffing everything into context — and is much cheaper.

Latency Constraints

If your application has latency requirements, filter models accordingly before considering quality:

Real-time voice/chat (<500ms to first token): Gemini Flash, GPT-4o mini, Claude Haiku, or a fast local model
Interactive UI (<3 seconds): Any flagship model works; avoid reasoning models
Async/background (>30 seconds acceptable): Any tier including reasoning models

Cost Modelling

When comparing models, calculate the effective cost per task, not just the per-token rate:

Token count matters as much as rate — A model that generates more verbose output at a lower rate may cost more than a concise model at a higher rate
Cached input tokens — Most providers offer significant discounts (50–90%) for cached input tokens (e.g., system prompts, static context). Structure your prompts to maximise cache hits.
Reasoning tokens — Count as output tokens; a single hard reasoning query can generate 10,000–30,000 thinking tokens
Batch APIs — Most providers offer 50% cost reduction for non-time-sensitive batch processing (results returned within 24 hours)

Data Sensitivity Decision

If your data cannot leave your infrastructure (HIPAA, GDPR, trade secrets, classified data), the model choice becomes straightforward:

Use a VPC-isolated offering from a cloud provider (Azure OpenAI, AWS Bedrock, GCP Vertex AI)
Or deploy an open-weight model on your own infrastructure (Ollama, vLLM)
Verify the specific data processing agreements, not just the marketing claims

Benchmarks as Selection Signals

Common benchmarks and what they tell you:

MMLU / MMLU-Pro — General knowledge breadth across 57 subjects. Good proxy for general intelligence but saturating at top models.
GPQA Diamond — Graduate-level science questions. Discriminates between frontier models where MMLU no longer does.
AIME 2024/2025 — Hard maths olympiad. Best benchmark for reasoning model comparison.
SWE-bench Verified — Real GitHub issue resolution. Best proxy for coding agent capability in production.
Chatbot Arena / LMSYS — Human preference ratings from blind head-to-head comparisons. Best proxy for "which model do humans prefer" rather than academic accuracy.

Most important benchmark: your own evaluation

No published benchmark tells you which model performs best on your specific task. Build a small evaluation set (50–200 representative examples) and measure quality on your actual use case before committing to a model in production.

Decision Matrix: Quick Reference

Need	Start with	If insufficient
Lowest cost, simple tasks	Claude Haiku / Gemini Flash	Claude Sonnet
Best general quality	Claude Sonnet or GPT-4o	Claude Opus or GPT-5
Hard reasoning / maths	o4-mini medium effort	o3 high effort
Very long documents	Gemini 2.5 Pro	RAG + Flagship
Data must stay on-prem	Llama 3.1 70B via Ollama	DeepSeek-V3 or fine-tuned Llama
Lowest latency	Gemini Flash / Haiku	Local 7B model
Best open-weight reasoning	DeepSeek-R1 API	DeepSeek-R1 32B local

Checklist: Do You Understand This?

What are the four model tiers and which type of task fits each?
When should you use a 1M token context window model vs RAG?
How do cached input tokens affect your cost modelling?
Name two benchmarks relevant to reasoning model selection and two relevant to general model selection.
What is the most reliable way to know which model is best for your specific application?
If your data cannot leave your infrastructure, what are your two main options?