Intermediate

Model Selection Cheat Sheet

A task-by-task guide to choosing the right model tier. Use this as a starting point β€” always validate with your specific prompts and data. The β€œdefault first try” recommendation is the most cost-efficient model that handles the task well in most cases. Escalate to the premium tier only when your evaluation shows it's needed.

Tiers: T1 = Frontier ($2–$25/1M out) Β· T2 = Capable ($1–$15/1M out) Β· T3 = Efficient ($0.10–$5/1M out) Β· T4 = Local/On-device

Text Processing Tasks

TaskDefault first tryPremium tierNotes
Summarization (short docs)T3 β€” Haiku / Flash-Lite / GPT-4o-miniT2 if nuance requiredT3 handles 95% of summaries well; escalate for legal or medical docs where precision is critical
Summarization (long docs, 50k+ tokens)T2 β€” Sonnet 4.6 / FlashT1 for highest fidelityLong-context tasks benefit from stronger models; check context window limits for your model tier
Email/message draftingT3 β€” Haiku / GPT-4o-miniT2 for high-stakes commsMost email drafts don't need frontier capability; T3 is plenty
TranslationT3 β€” Flash-Lite / GPT-4o-miniT2 for rare languages or high formalityMajor languages (EN/ES/FR/DE/ZH/JA) translate well at T3. Minor languages may need T2.
Classification / intent detectionT3 β€” Haiku / Flash-LiteT3 almost always sufficientClassification is one of the clearest cases where T3 matches T1 quality. Fine-tuning T3 makes it even better.
Named entity extractionT3 β€” Haiku / GPT-4o-miniT2 for complex nested schemasStructured output mode + T3 = reliable extraction at low cost
Sentiment analysisT3 β€” any efficient modelT3 always sufficientEven T4 local models handle sentiment well. No need for T1 or T2.

Coding Tasks

TaskDefault first tryPremium tierNotes
Inline code completionT3 or T4 β€” fast local modelT3 for cloudLatency is critical; use local models (Phi-4-mini, Codestral) or T3 fast models (Haiku)
Code explanation / reviewT3 β€” Haiku / FlashT2 for large codebasesT3 explains and reviews code well. Upgrade to T2 for complex multi-file architecture review.
Feature implementation (chat)T2 β€” Sonnet / GPT-4.1 / FlashT1 for hardest tasksMulti-file feature work benefits from T2 capability. T1 for complex algorithmic challenges.
Debugging complex errorsT2 β€” Sonnet / FlashT1 or reasoning model for hard bugsMulti-step reasoning helps. For obscure crashes or concurrency bugs, reasoning models (o3) are worth the cost.
Agentic coding (agent mode)T2 β€” Sonnet / GPT-4.1T1 for complex refactorsAgent mode makes many API calls per task. T1 costs compound quickly β€” T2 is usually sufficient and much cheaper overall.
Unit test generationT3 β€” Haiku / Flash-LiteT2 for edge case depthStandard unit tests are pattern-heavy; T3 handles them well

RAG & Knowledge Tasks

TaskDefault first tryPremium tierNotes
Simple Q&A over documentsT3 β€” Haiku / FlashT2 for multi-hop reasoningWhen retrieval is good, even T3 can answer accurately. Quality of chunks matters more than model tier.
Multi-hop RAG (answer requires synthesizing multiple docs)T2 β€” Sonnet / FlashT1 for complex synthesisMulti-document synthesis benefits from stronger reasoning. T3 may miss connections between sources.
Embedding generationDedicated embedding modelβ€”Use text-embedding-3 (OpenAI) or Gemini embedding. Don't use chat models for embeddings β€” dedicated models are cheaper and better.
Reranking retrieved chunksDedicated reranker (Cohere, cross-encoder)T3 LLM reranker if neededDedicated rerankers are faster and cheaper than asking an LLM to rank. Use LLM reranking only for final top-3 selection on complex tasks.
Groundedness checking (hallucination detection)T2 β€” Sonnet / FlashT1 for high-stakes outputsUse as a pipeline step to verify generated answers against source documents

Agentic Tasks

TaskDefault first tryNotes
Tool routing / intent triageT3 β€” Haiku / Flash-LiteFirst step in any agent pipeline β€” use the cheapest model to decide which tool to call
Tool execution (structured outputs)T2 β€” Sonnet / GPT-4.1T2 is reliable for structured JSON tool calls. T3 can work but has higher error rates on complex schemas.
Multi-step autonomous taskT2 β€” Sonnet / GPT-4.1T1 costs explode across many agent steps. T2 is the right balance for production agentic workflows.
Final answer synthesis (planner/reviewer)T1 β€” Opus / o3Use T1 only for the final synthesis/review step in a multi-model pipeline, not for every step

Creative & Strategic Tasks

TaskDefault first tryNotes
Long-form content (blogs, reports)T2 β€” Sonnet / GPT-4.1Quality shows in long-form writing. T2 is the sweet spot β€” meaningfully better than T3 without T1 cost.
Short-form copy (social, ads)T3 β€” Haiku / FlashShort creative tasks don't need frontier models. Generate 10 variants with T3 for less than 1 variant with T1.
Strategic analysis / researchT1 β€” Opus / o3Deep strategic reasoning is one area where frontier capability genuinely shows. Worth the cost for high-stakes decisions.
Math and science problemsT1 reasoning β€” o3 / o4-miniReasoning models dramatically outperform standard models on hard math. For simple calculations, T3 is fine.

Multimodal Tasks

TaskDefault first tryNotes
OCR / document image extractionT2 with vision β€” Sonnet / FlashVision models at T2 handle most document OCR well. T3 with vision also viable for simple layouts.
Image captioning / descriptionT3 β€” Flash / HaikuStandard image descriptions don't need frontier models
Visual Q&A (complex scene understanding)T2 β€” Flash / SonnetT1 for specialized domains (medical imaging, technical schematics)
Speech-to-textDedicated STT (Whisper, Deepgram)Use specialized speech models β€” not LLMs. Much cheaper and better for transcription.
Audio + text multimodalT2 or dedicated (Phi-4-multimodal)Phi-4-multimodal is a strong on-device option for voice + text on edge hardware

Quick Reference Card

Always start with T3:

  • Classification, labeling, intent detection
  • Sentiment analysis
  • Summarization (short docs)
  • Translation (major languages)
  • Entity extraction (simple schemas)
  • Email drafting
  • Short creative copy
  • Unit test generation

Use T2 by default:

  • Feature coding, debugging
  • RAG Q&A with reasoning
  • Document analysis (complex)
  • Multi-step tool calling
  • Long-form content writing
  • Customer support agents
  • Visual document extraction

Only use T1 when needed:

  • Complex strategic analysis
  • Hard math / science problems
  • Multi-doc synthesis requiring deep reasoning
  • Final QA / evaluation step in pipelines
  • Long-horizon agentic tasks (final reviewer)
  • Creative work where T2 quality is visibly insufficient

Use dedicated models, not LLMs:

  • Embeddings β†’ text-embedding-3, Gemini embedding
  • Reranking β†’ Cohere Rerank, cross-encoders
  • Speech-to-text β†’ Whisper, Deepgram
  • Text-to-speech β†’ ElevenLabs, Azure Speech
  • Image generation β†’ DALL-E 3, Stable Diffusion

Checklist: Do You Understand This?

  • T3 first for: classification, summarization, translation, entity extraction, email drafts, short copy
  • T2 by default for: coding, RAG Q&A, customer support agents, long-form content, visual document extraction
  • T1 only when needed: hard reasoning, complex strategic analysis, math, final QA in pipelines
  • Reasoning models (o3) for: hard math, science problems, complex multi-step logic β€” not for interactive tasks
  • Use dedicated models for: embeddings, reranking, STT/TTS, image generation β€” not general LLMs

Page built: 01 Jun 2026