Choosing the Right Model
With dozens of capable models across multiple providers, model selection is a genuine engineering decision β not just defaulting to "use GPT-4." This page gives you a systematic framework for matching your task requirements to the right model tier, provider, and configuration.
The Model Tier Framework
Think of models in four capability tiers. Most selection decisions reduce to picking the right tier first, then the specific model within that tier:
Efficient / Mini Tier
Examples: Claude Haiku, Gemini Flash, GPT-4o mini
Best for: Classification, simple extraction, routing decisions, high-volume tasks where cost matters more than peak quality
Cost: $0.10β$0.50 per million input tokens
Latency: Very fast (sub-second to first token)
Flagship / Balanced Tier
Examples: Claude Sonnet, Gemini Pro, GPT-4o
Best for: General-purpose work β writing, coding, analysis, chat, document processing. Default tier for most production applications.
Cost: $2β5 per million input tokens
Latency: Fast (1β3 seconds to first token)
Frontier Tier
Examples: Claude Opus, GPT-5, Gemini Ultra
Best for: Hardest general-purpose tasks, research-grade work, highest-quality output requirements
Cost: $10β30+ per million input tokens
Latency: Moderate (2β8 seconds to first token)
Reasoning Tier
Examples: o3, o4-mini, DeepSeek-R1, Claude extended thinking
Best for: Maths, formal logic, complex code, multi-step planning. Dramatically better on hard reasoning; expensive and slow for simple tasks.
Cost: $1β40+ per million tokens (varies enormously with thinking budget)
Latency: Slow (15 seconds to 5 minutes per query)
Task Routing by Query Type
| Task type | Recommended tier | Notes |
|---|---|---|
| Classification / routing | Efficient | Often a 7B local model works; no need for frontier |
| Simple extraction / summarisation | EfficientβFlagship | Depends on document complexity |
| Writing / editing | Flagship | Frontier adds marginal quality for most writing |
| Code generation (simple) | Flagship | Claude Sonnet or GPT-4o suffice for most code |
| Code generation (complex algorithms) | Frontier or Reasoning | Try Frontier first; escalate to Reasoning if needed |
| Hard maths / logic | Reasoning | Reasoning models are dramatically better here |
| Document Q&A (large docs) | Flagship with long context | Gemini 2.5 Pro at 1M tokens; or RAG for very large corpora |
| Multi-step agentic planning | Reasoning or Frontier | Reasoning models reduce cascading errors in agents |
| Image analysis | Flagship | All flagship models handle images; Gemini for video |
| Real-time voice | Efficient + fast TTS | Latency is paramount; use fastest available STT+LLM+TTS |
Context Window Requirements
Context window = the total tokens (input + output) the model can process at once. Choose based on your actual data size:
- Standard (128Kβ200K) β Most flagship models. Handles documents up to ~300β500 pages, long conversation histories, moderate codebases.
- Extended (1M tokens) β Gemini 2.5 Pro. Handles entire codebases (100K+ lines), very long books, hundreds of documents simultaneously.
- Ultra-long (10M tokens) β Llama 4 Scout (experimental). Full document libraries; most use cases don't need this.
Longer context β better quality. Models often degrade on information in the middle of very long contexts ("lost in the middle" problem). For large knowledge bases, RAG frequently outperforms stuffing everything into context β and is much cheaper.
Latency Constraints
If your application has latency requirements, filter models accordingly before considering quality:
- Real-time voice/chat (<500ms to first token): Gemini Flash, GPT-4o mini, Claude Haiku, or a fast local model
- Interactive UI (<3 seconds): Any flagship model works; avoid reasoning models
- Async/background (>30 seconds acceptable): Any tier including reasoning models
Cost Modelling
When comparing models, calculate the effective cost per task, not just the per-token rate:
- Token count matters as much as rate β A model that generates more verbose output at a lower rate may cost more than a concise model at a higher rate
- Cached input tokens β Most providers offer significant discounts (50β90%) for cached input tokens (e.g., system prompts, static context). Structure your prompts to maximise cache hits.
- Reasoning tokens β Count as output tokens; a single hard reasoning query can generate 10,000β30,000 thinking tokens
- Batch APIs β Most providers offer 50% cost reduction for non-time-sensitive batch processing (results returned within 24 hours)
Data Sensitivity Decision
If your data cannot leave your infrastructure (HIPAA, GDPR, trade secrets, classified data), the model choice becomes straightforward:
- Use a VPC-isolated offering from a cloud provider (Azure OpenAI, AWS Bedrock, GCP Vertex AI)
- Or deploy an open-weight model on your own infrastructure (Ollama, vLLM)
- Verify the specific data processing agreements, not just the marketing claims
Benchmarks as Selection Signals
Common benchmarks and what they tell you:
- MMLU / MMLU-Pro β General knowledge breadth across 57 subjects. Good proxy for general intelligence but saturating at top models.
- GPQA Diamond β Graduate-level science questions. Discriminates between frontier models where MMLU no longer does.
- AIME 2024/2025 β Hard maths olympiad. Best benchmark for reasoning model comparison.
- SWE-bench Verified β Real GitHub issue resolution. Best proxy for coding agent capability in production.
- Chatbot Arena / LMSYS β Human preference ratings from blind head-to-head comparisons. Best proxy for "which model do humans prefer" rather than academic accuracy.
Most important benchmark: your own evaluation
No published benchmark tells you which model performs best on your specific task. Build a small evaluation set (50β200 representative examples) and measure quality on your actual use case before committing to a model in production.
Decision Matrix: Quick Reference
| Need | Start with | If insufficient |
|---|---|---|
| Lowest cost, simple tasks | Claude Haiku / Gemini Flash | Claude Sonnet |
| Best general quality | Claude Sonnet or GPT-4o | Claude Opus or GPT-5 |
| Hard reasoning / maths | o4-mini medium effort | o3 high effort |
| Very long documents | Gemini 2.5 Pro | RAG + Flagship |
| Data must stay on-prem | Llama 3.1 70B via Ollama | DeepSeek-V3 or fine-tuned Llama |
| Lowest latency | Gemini Flash / Haiku | Local 7B model |
| Best open-weight reasoning | DeepSeek-R1 API | DeepSeek-R1 32B local |
Checklist: Do You Understand This?
- What are the four model tiers and which type of task fits each?
- When should you use a 1M token context window model vs RAG?
- How do cached input tokens affect your cost modelling?
- Name two benchmarks relevant to reasoning model selection and two relevant to general model selection.
- What is the most reliable way to know which model is best for your specific application?
- If your data cannot leave your infrastructure, what are your two main options?