🧠 All Things AI
Intermediate

Choosing the Right Model

With dozens of capable models across multiple providers, model selection is a genuine engineering decision β€” not just defaulting to "use GPT-4." This page gives you a systematic framework for matching your task requirements to the right model tier, provider, and configuration.

The Model Tier Framework

Think of models in four capability tiers. Most selection decisions reduce to picking the right tier first, then the specific model within that tier:

Efficient / Mini Tier

Examples: Claude Haiku, Gemini Flash, GPT-4o mini
Best for: Classification, simple extraction, routing decisions, high-volume tasks where cost matters more than peak quality
Cost: $0.10–$0.50 per million input tokens
Latency: Very fast (sub-second to first token)

Flagship / Balanced Tier

Examples: Claude Sonnet, Gemini Pro, GPT-4o
Best for: General-purpose work β€” writing, coding, analysis, chat, document processing. Default tier for most production applications.
Cost: $2–5 per million input tokens
Latency: Fast (1–3 seconds to first token)

Frontier Tier

Examples: Claude Opus, GPT-5, Gemini Ultra
Best for: Hardest general-purpose tasks, research-grade work, highest-quality output requirements
Cost: $10–30+ per million input tokens
Latency: Moderate (2–8 seconds to first token)

Reasoning Tier

Examples: o3, o4-mini, DeepSeek-R1, Claude extended thinking
Best for: Maths, formal logic, complex code, multi-step planning. Dramatically better on hard reasoning; expensive and slow for simple tasks.
Cost: $1–40+ per million tokens (varies enormously with thinking budget)
Latency: Slow (15 seconds to 5 minutes per query)

Task Routing by Query Type

Task typeRecommended tierNotes
Classification / routingEfficientOften a 7B local model works; no need for frontier
Simple extraction / summarisationEfficient–FlagshipDepends on document complexity
Writing / editingFlagshipFrontier adds marginal quality for most writing
Code generation (simple)FlagshipClaude Sonnet or GPT-4o suffice for most code
Code generation (complex algorithms)Frontier or ReasoningTry Frontier first; escalate to Reasoning if needed
Hard maths / logicReasoningReasoning models are dramatically better here
Document Q&A (large docs)Flagship with long contextGemini 2.5 Pro at 1M tokens; or RAG for very large corpora
Multi-step agentic planningReasoning or FrontierReasoning models reduce cascading errors in agents
Image analysisFlagshipAll flagship models handle images; Gemini for video
Real-time voiceEfficient + fast TTSLatency is paramount; use fastest available STT+LLM+TTS

Context Window Requirements

Context window = the total tokens (input + output) the model can process at once. Choose based on your actual data size:

  • Standard (128K–200K) β€” Most flagship models. Handles documents up to ~300–500 pages, long conversation histories, moderate codebases.
  • Extended (1M tokens) β€” Gemini 2.5 Pro. Handles entire codebases (100K+ lines), very long books, hundreds of documents simultaneously.
  • Ultra-long (10M tokens) β€” Llama 4 Scout (experimental). Full document libraries; most use cases don't need this.

Longer context β‰  better quality. Models often degrade on information in the middle of very long contexts ("lost in the middle" problem). For large knowledge bases, RAG frequently outperforms stuffing everything into context β€” and is much cheaper.

Latency Constraints

If your application has latency requirements, filter models accordingly before considering quality:

  • Real-time voice/chat (<500ms to first token): Gemini Flash, GPT-4o mini, Claude Haiku, or a fast local model
  • Interactive UI (<3 seconds): Any flagship model works; avoid reasoning models
  • Async/background (>30 seconds acceptable): Any tier including reasoning models

Cost Modelling

When comparing models, calculate the effective cost per task, not just the per-token rate:

  • Token count matters as much as rate β€” A model that generates more verbose output at a lower rate may cost more than a concise model at a higher rate
  • Cached input tokens β€” Most providers offer significant discounts (50–90%) for cached input tokens (e.g., system prompts, static context). Structure your prompts to maximise cache hits.
  • Reasoning tokens β€” Count as output tokens; a single hard reasoning query can generate 10,000–30,000 thinking tokens
  • Batch APIs β€” Most providers offer 50% cost reduction for non-time-sensitive batch processing (results returned within 24 hours)

Data Sensitivity Decision

If your data cannot leave your infrastructure (HIPAA, GDPR, trade secrets, classified data), the model choice becomes straightforward:

  1. Use a VPC-isolated offering from a cloud provider (Azure OpenAI, AWS Bedrock, GCP Vertex AI)
  2. Or deploy an open-weight model on your own infrastructure (Ollama, vLLM)
  3. Verify the specific data processing agreements, not just the marketing claims

Benchmarks as Selection Signals

Common benchmarks and what they tell you:

  • MMLU / MMLU-Pro β€” General knowledge breadth across 57 subjects. Good proxy for general intelligence but saturating at top models.
  • GPQA Diamond β€” Graduate-level science questions. Discriminates between frontier models where MMLU no longer does.
  • AIME 2024/2025 β€” Hard maths olympiad. Best benchmark for reasoning model comparison.
  • SWE-bench Verified β€” Real GitHub issue resolution. Best proxy for coding agent capability in production.
  • Chatbot Arena / LMSYS β€” Human preference ratings from blind head-to-head comparisons. Best proxy for "which model do humans prefer" rather than academic accuracy.

Most important benchmark: your own evaluation

No published benchmark tells you which model performs best on your specific task. Build a small evaluation set (50–200 representative examples) and measure quality on your actual use case before committing to a model in production.

Decision Matrix: Quick Reference

NeedStart withIf insufficient
Lowest cost, simple tasksClaude Haiku / Gemini FlashClaude Sonnet
Best general qualityClaude Sonnet or GPT-4oClaude Opus or GPT-5
Hard reasoning / mathso4-mini medium efforto3 high effort
Very long documentsGemini 2.5 ProRAG + Flagship
Data must stay on-premLlama 3.1 70B via OllamaDeepSeek-V3 or fine-tuned Llama
Lowest latencyGemini Flash / HaikuLocal 7B model
Best open-weight reasoningDeepSeek-R1 APIDeepSeek-R1 32B local

Checklist: Do You Understand This?

  • What are the four model tiers and which type of task fits each?
  • When should you use a 1M token context window model vs RAG?
  • How do cached input tokens affect your cost modelling?
  • Name two benchmarks relevant to reasoning model selection and two relevant to general model selection.
  • What is the most reliable way to know which model is best for your specific application?
  • If your data cannot leave your infrastructure, what are your two main options?