Intermediate
Model Selection Cheat Sheet
A task-by-task guide to choosing the right model tier. Use this as a starting point β always validate with your specific prompts and data. The βdefault first tryβ recommendation is the most cost-efficient model that handles the task well in most cases. Escalate to the premium tier only when your evaluation shows it's needed.
Tiers: T1 = Frontier ($2β$25/1M out) Β· T2 = Capable ($1β$15/1M out) Β· T3 = Efficient ($0.10β$5/1M out) Β· T4 = Local/On-device
Text Processing Tasks
| Task | Default first try | Premium tier | Notes |
|---|---|---|---|
| Summarization (short docs) | T3 β Haiku / Flash-Lite / GPT-4o-mini | T2 if nuance required | T3 handles 95% of summaries well; escalate for legal or medical docs where precision is critical |
| Summarization (long docs, 50k+ tokens) | T2 β Sonnet 4.6 / Flash | T1 for highest fidelity | Long-context tasks benefit from stronger models; check context window limits for your model tier |
| Email/message drafting | T3 β Haiku / GPT-4o-mini | T2 for high-stakes comms | Most email drafts don't need frontier capability; T3 is plenty |
| Translation | T3 β Flash-Lite / GPT-4o-mini | T2 for rare languages or high formality | Major languages (EN/ES/FR/DE/ZH/JA) translate well at T3. Minor languages may need T2. |
| Classification / intent detection | T3 β Haiku / Flash-Lite | T3 almost always sufficient | Classification is one of the clearest cases where T3 matches T1 quality. Fine-tuning T3 makes it even better. |
| Named entity extraction | T3 β Haiku / GPT-4o-mini | T2 for complex nested schemas | Structured output mode + T3 = reliable extraction at low cost |
| Sentiment analysis | T3 β any efficient model | T3 always sufficient | Even T4 local models handle sentiment well. No need for T1 or T2. |
Coding Tasks
| Task | Default first try | Premium tier | Notes |
|---|---|---|---|
| Inline code completion | T3 or T4 β fast local model | T3 for cloud | Latency is critical; use local models (Phi-4-mini, Codestral) or T3 fast models (Haiku) |
| Code explanation / review | T3 β Haiku / Flash | T2 for large codebases | T3 explains and reviews code well. Upgrade to T2 for complex multi-file architecture review. |
| Feature implementation (chat) | T2 β Sonnet / GPT-4.1 / Flash | T1 for hardest tasks | Multi-file feature work benefits from T2 capability. T1 for complex algorithmic challenges. |
| Debugging complex errors | T2 β Sonnet / Flash | T1 or reasoning model for hard bugs | Multi-step reasoning helps. For obscure crashes or concurrency bugs, reasoning models (o3) are worth the cost. |
| Agentic coding (agent mode) | T2 β Sonnet / GPT-4.1 | T1 for complex refactors | Agent mode makes many API calls per task. T1 costs compound quickly β T2 is usually sufficient and much cheaper overall. |
| Unit test generation | T3 β Haiku / Flash-Lite | T2 for edge case depth | Standard unit tests are pattern-heavy; T3 handles them well |
RAG & Knowledge Tasks
| Task | Default first try | Premium tier | Notes |
|---|---|---|---|
| Simple Q&A over documents | T3 β Haiku / Flash | T2 for multi-hop reasoning | When retrieval is good, even T3 can answer accurately. Quality of chunks matters more than model tier. |
| Multi-hop RAG (answer requires synthesizing multiple docs) | T2 β Sonnet / Flash | T1 for complex synthesis | Multi-document synthesis benefits from stronger reasoning. T3 may miss connections between sources. |
| Embedding generation | Dedicated embedding model | β | Use text-embedding-3 (OpenAI) or Gemini embedding. Don't use chat models for embeddings β dedicated models are cheaper and better. |
| Reranking retrieved chunks | Dedicated reranker (Cohere, cross-encoder) | T3 LLM reranker if needed | Dedicated rerankers are faster and cheaper than asking an LLM to rank. Use LLM reranking only for final top-3 selection on complex tasks. |
| Groundedness checking (hallucination detection) | T2 β Sonnet / Flash | T1 for high-stakes outputs | Use as a pipeline step to verify generated answers against source documents |
Agentic Tasks
| Task | Default first try | Notes |
|---|---|---|
| Tool routing / intent triage | T3 β Haiku / Flash-Lite | First step in any agent pipeline β use the cheapest model to decide which tool to call |
| Tool execution (structured outputs) | T2 β Sonnet / GPT-4.1 | T2 is reliable for structured JSON tool calls. T3 can work but has higher error rates on complex schemas. |
| Multi-step autonomous task | T2 β Sonnet / GPT-4.1 | T1 costs explode across many agent steps. T2 is the right balance for production agentic workflows. |
| Final answer synthesis (planner/reviewer) | T1 β Opus / o3 | Use T1 only for the final synthesis/review step in a multi-model pipeline, not for every step |
Creative & Strategic Tasks
| Task | Default first try | Notes |
|---|---|---|
| Long-form content (blogs, reports) | T2 β Sonnet / GPT-4.1 | Quality shows in long-form writing. T2 is the sweet spot β meaningfully better than T3 without T1 cost. |
| Short-form copy (social, ads) | T3 β Haiku / Flash | Short creative tasks don't need frontier models. Generate 10 variants with T3 for less than 1 variant with T1. |
| Strategic analysis / research | T1 β Opus / o3 | Deep strategic reasoning is one area where frontier capability genuinely shows. Worth the cost for high-stakes decisions. |
| Math and science problems | T1 reasoning β o3 / o4-mini | Reasoning models dramatically outperform standard models on hard math. For simple calculations, T3 is fine. |
Multimodal Tasks
| Task | Default first try | Notes |
|---|---|---|
| OCR / document image extraction | T2 with vision β Sonnet / Flash | Vision models at T2 handle most document OCR well. T3 with vision also viable for simple layouts. |
| Image captioning / description | T3 β Flash / Haiku | Standard image descriptions don't need frontier models |
| Visual Q&A (complex scene understanding) | T2 β Flash / Sonnet | T1 for specialized domains (medical imaging, technical schematics) |
| Speech-to-text | Dedicated STT (Whisper, Deepgram) | Use specialized speech models β not LLMs. Much cheaper and better for transcription. |
| Audio + text multimodal | T2 or dedicated (Phi-4-multimodal) | Phi-4-multimodal is a strong on-device option for voice + text on edge hardware |
Quick Reference Card
Always start with T3:
- Classification, labeling, intent detection
- Sentiment analysis
- Summarization (short docs)
- Translation (major languages)
- Entity extraction (simple schemas)
- Email drafting
- Short creative copy
- Unit test generation
Use T2 by default:
- Feature coding, debugging
- RAG Q&A with reasoning
- Document analysis (complex)
- Multi-step tool calling
- Long-form content writing
- Customer support agents
- Visual document extraction
Only use T1 when needed:
- Complex strategic analysis
- Hard math / science problems
- Multi-doc synthesis requiring deep reasoning
- Final QA / evaluation step in pipelines
- Long-horizon agentic tasks (final reviewer)
- Creative work where T2 quality is visibly insufficient
Use dedicated models, not LLMs:
- Embeddings β text-embedding-3, Gemini embedding
- Reranking β Cohere Rerank, cross-encoders
- Speech-to-text β Whisper, Deepgram
- Text-to-speech β ElevenLabs, Azure Speech
- Image generation β DALL-E 3, Stable Diffusion
Checklist: Do You Understand This?
- T3 first for: classification, summarization, translation, entity extraction, email drafts, short copy
- T2 by default for: coding, RAG Q&A, customer support agents, long-form content, visual document extraction
- T1 only when needed: hard reasoning, complex strategic analysis, math, final QA in pipelines
- Reasoning models (o3) for: hard math, science problems, complex multi-step logic β not for interactive tasks
- Use dedicated models for: embeddings, reranking, STT/TTS, image generation β not general LLMs