Intermediate

Model Selection Cheat Sheet

A task-by-task guide to choosing the right model tier. Use this as a starting point — always validate with your specific prompts and data. The “default first try” recommendation is the most cost-efficient model that handles the task well in most cases. Escalate to the premium tier only when your evaluation shows it's needed.

Tiers: T1 = Frontier ($2–$25/1M out) · T2 = Capable ($1–$15/1M out) · T3 = Efficient ($0.10–$5/1M out) · T4 = Local/On-device

Text Processing Tasks

Task	Default first try	Premium tier	Notes
Summarization (short docs)	T3 — Haiku / Flash-Lite / GPT-4o-mini	T2 if nuance required	T3 handles 95% of summaries well; escalate for legal or medical docs where precision is critical
Summarization (long docs, 50k+ tokens)	T2 — Sonnet 4.6 / Flash	T1 for highest fidelity	Long-context tasks benefit from stronger models; check context window limits for your model tier
Email/message drafting	T3 — Haiku / GPT-4o-mini	T2 for high-stakes comms	Most email drafts don't need frontier capability; T3 is plenty
Translation	T3 — Flash-Lite / GPT-4o-mini	T2 for rare languages or high formality	Major languages (EN/ES/FR/DE/ZH/JA) translate well at T3. Minor languages may need T2.
Classification / intent detection	T3 — Haiku / Flash-Lite	T3 almost always sufficient	Classification is one of the clearest cases where T3 matches T1 quality. Fine-tuning T3 makes it even better.
Named entity extraction	T3 — Haiku / GPT-4o-mini	T2 for complex nested schemas	Structured output mode + T3 = reliable extraction at low cost
Sentiment analysis	T3 — any efficient model	T3 always sufficient	Even T4 local models handle sentiment well. No need for T1 or T2.

Coding Tasks

Task	Default first try	Premium tier	Notes
Inline code completion	T3 or T4 — fast local model	T3 for cloud	Latency is critical; use local models (Phi-4-mini, Codestral) or T3 fast models (Haiku)
Code explanation / review	T3 — Haiku / Flash	T2 for large codebases	T3 explains and reviews code well. Upgrade to T2 for complex multi-file architecture review.
Feature implementation (chat)	T2 — Sonnet / GPT-4.1 / Flash	T1 for hardest tasks	Multi-file feature work benefits from T2 capability. T1 for complex algorithmic challenges.
Debugging complex errors	T2 — Sonnet / Flash	T1 or reasoning model for hard bugs	Multi-step reasoning helps. For obscure crashes or concurrency bugs, reasoning models (o3) are worth the cost.
Agentic coding (agent mode)	T2 — Sonnet / GPT-4.1	T1 for complex refactors	Agent mode makes many API calls per task. T1 costs compound quickly — T2 is usually sufficient and much cheaper overall.
Unit test generation	T3 — Haiku / Flash-Lite	T2 for edge case depth	Standard unit tests are pattern-heavy; T3 handles them well

RAG & Knowledge Tasks

Task	Default first try	Premium tier	Notes
Simple Q&A over documents	T3 — Haiku / Flash	T2 for multi-hop reasoning	When retrieval is good, even T3 can answer accurately. Quality of chunks matters more than model tier.
Multi-hop RAG (answer requires synthesizing multiple docs)	T2 — Sonnet / Flash	T1 for complex synthesis	Multi-document synthesis benefits from stronger reasoning. T3 may miss connections between sources.
Embedding generation	Dedicated embedding model	—	Use text-embedding-3 (OpenAI) or Gemini embedding. Don't use chat models for embeddings — dedicated models are cheaper and better.
Reranking retrieved chunks	Dedicated reranker (Cohere, cross-encoder)	T3 LLM reranker if needed	Dedicated rerankers are faster and cheaper than asking an LLM to rank. Use LLM reranking only for final top-3 selection on complex tasks.
Groundedness checking (hallucination detection)	T2 — Sonnet / Flash	T1 for high-stakes outputs	Use as a pipeline step to verify generated answers against source documents

Agentic Tasks

Task	Default first try	Notes
Tool routing / intent triage	T3 — Haiku / Flash-Lite	First step in any agent pipeline — use the cheapest model to decide which tool to call
Tool execution (structured outputs)	T2 — Sonnet / GPT-4.1	T2 is reliable for structured JSON tool calls. T3 can work but has higher error rates on complex schemas.
Multi-step autonomous task	T2 — Sonnet / GPT-4.1	T1 costs explode across many agent steps. T2 is the right balance for production agentic workflows.
Final answer synthesis (planner/reviewer)	T1 — Opus / o3	Use T1 only for the final synthesis/review step in a multi-model pipeline, not for every step

Creative & Strategic Tasks

Task	Default first try	Notes
Long-form content (blogs, reports)	T2 — Sonnet / GPT-4.1	Quality shows in long-form writing. T2 is the sweet spot — meaningfully better than T3 without T1 cost.
Short-form copy (social, ads)	T3 — Haiku / Flash	Short creative tasks don't need frontier models. Generate 10 variants with T3 for less than 1 variant with T1.
Strategic analysis / research	T1 — Opus / o3	Deep strategic reasoning is one area where frontier capability genuinely shows. Worth the cost for high-stakes decisions.
Math and science problems	T1 reasoning — o3 / o4-mini	Reasoning models dramatically outperform standard models on hard math. For simple calculations, T3 is fine.

Multimodal Tasks

Task	Default first try	Notes
OCR / document image extraction	T2 with vision — Sonnet / Flash	Vision models at T2 handle most document OCR well. T3 with vision also viable for simple layouts.
Image captioning / description	T3 — Flash / Haiku	Standard image descriptions don't need frontier models
Visual Q&A (complex scene understanding)	T2 — Flash / Sonnet	T1 for specialized domains (medical imaging, technical schematics)
Speech-to-text	Dedicated STT (Whisper, Deepgram)	Use specialized speech models — not LLMs. Much cheaper and better for transcription.
Audio + text multimodal	T2 or dedicated (Phi-4-multimodal)	Phi-4-multimodal is a strong on-device option for voice + text on edge hardware

Quick Reference Card

Always start with T3:

Classification, labeling, intent detection
Sentiment analysis
Summarization (short docs)
Translation (major languages)
Entity extraction (simple schemas)
Email drafting
Short creative copy
Unit test generation

Use T2 by default:

Feature coding, debugging
RAG Q&A with reasoning
Document analysis (complex)
Multi-step tool calling
Long-form content writing
Customer support agents
Visual document extraction

Only use T1 when needed:

Complex strategic analysis
Hard math / science problems
Multi-doc synthesis requiring deep reasoning
Final QA / evaluation step in pipelines
Long-horizon agentic tasks (final reviewer)
Creative work where T2 quality is visibly insufficient

Use dedicated models, not LLMs:

Embeddings → text-embedding-3, Gemini embedding
Reranking → Cohere Rerank, cross-encoders
Speech-to-text → Whisper, Deepgram
Text-to-speech → ElevenLabs, Azure Speech
Image generation → DALL-E 3, Stable Diffusion

Checklist: Do You Understand This?

T3 first for: classification, summarization, translation, entity extraction, email drafts, short copy
T2 by default for: coding, RAG Q&A, customer support agents, long-form content, visual document extraction
T1 only when needed: hard reasoning, complex strategic analysis, math, final QA in pipelines
Reasoning models (o3) for: hard math, science problems, complex multi-step logic — not for interactive tasks
Use dedicated models for: embeddings, reranking, STT/TTS, image generation — not general LLMs