Multi-Model Routing
Not every request needs the most capable model. Routing requests to the right model for the task — based on complexity, latency requirements, cost, and provider availability — is one of the highest-leverage architectural decisions in an enterprise AI system. Done well, routing cuts costs by 60-80% with no user-visible quality degradation.
Routing Dimensions
| Dimension | What it controls | Example routing rule |
|---|---|---|
| Task complexity | Simple → cheap fast model; complex → capable expensive model | Classify as "simple" → Haiku; "complex" → Sonnet; "reasoning" → Opus |
| Latency SLA | User-facing interactive → low latency model; async batch → higher latency acceptable | Interactive chat → GPT-4o-mini (fast); overnight batch → GPT-4o (thorough) |
| Capability match | Route to model with the strongest capability for the task type | Code tasks → Claude Sonnet (strong coding); vision → GPT-4o (multimodal) |
| Cost budget | When daily budget approaches ceiling, route to cheaper model | If daily spend > 80% of budget → downgrade all non-critical requests to cheaper tier |
| Provider availability | If primary provider is down or rate-limited, route to secondary | Anthropic 5xx → failover to OpenAI equivalent; Anthropic 429 → failover to Bedrock |
Classification-Based Routing
A small, fast classifier decides which model handles each request before the request reaches the LLM. The classifier itself should be cheap — a rule-based heuristic, a small embedding classifier, or a sub-100ms LLM call.
def classify_and_route(request: LLMRequest) -> str:
# Rule-based fast path — no ML classifier overhead
if request.token_count < 500 and "summarise" in request.system_prompt:
return "claude-haiku-4-5" # fast + cheap for simple summarisation
if request.requires_code_execution or request.context_length > 50_000:
return "claude-sonnet-4-6" # capable model for complex tasks
if request.use_case == "financial_analysis" and request.requires_reasoning:
return "claude-opus-4-6" # highest capability for high-stakes reasoning
# Default: mid-tier model for everything else
return "claude-sonnet-4-6"
Rule-based classification (recommended first)
- No latency overhead — runs in microseconds
- Deterministic and auditable — easy to debug misrouting
- Based on: token count, use case tag, system prompt pattern, feature flags
- Start here before adding ML classifiers — simpler rules catch 80% of cases
ML classifier (when rules are insufficient)
- Embed the query; classify using a lightweight model (BERT-class)
- Adds 10-50ms latency for the classification step
- Useful when task complexity cannot be determined from metadata alone
- Requires labelled training data; maintain eval set to catch classifier drift
Cascade Pattern
The cascade pattern tries the fast cheap model first. If the response does not meet a quality threshold, it falls back to the capable expensive model. The quality check must be fast — a heuristic or small classifier, not another LLM call.
| Stage | Action | Quality gate |
|---|---|---|
| Stage 1 | Send to fast cheap model (e.g., Haiku / GPT-4o-mini) | Check: response length reasonable? No refusal? Confidence score above threshold? |
| Stage 2 (if gate fails) | Send to capable model (e.g., Sonnet / GPT-4o) | Serve this response; log that fallback was triggered |
Cascade adds latency on fallback
When the Stage 1 model fails the quality gate, the total latency is Stage 1 time + quality gate time + Stage 2 time. This is higher than going directly to Stage 2. Cascade is worth it when the majority of requests pass Stage 1 — if fallback rate exceeds 40%, route directly to Stage 2 instead.
Provider Failover
A provider abstraction layer makes failover transparent to the rest of your application. The application calls a single interface; the router handles provider selection and failover.
LiteLLM (open source, recommended)
- Unified OpenAI-compatible API for 100+ models and providers
- Built-in fallback lists: if primary fails, try secondary in order
- Built-in retry logic, load balancing, and budget management
- Runs as a proxy service or as a Python library
- 2025 status: widely adopted in enterprise; supports Anthropic/OpenAI/Bedrock/Vertex/Cohere/local
LiteLLM failover config
model_list:
- model_name: claude-primary
litellm_params:
model: anthropic/claude-sonnet-4-6
- model_name: claude-primary
litellm_params:
model: openai/gpt-4o # fallback
router_settings:
routing_strategy: simple-shuffle
num_retries: 2
fallbacks: [{"claude-primary": ["openai/gpt-4o"]}]
Quality Gate Implementation
A quality gate checks whether the fast model's output is good enough before serving it. It must be significantly cheaper and faster than the LLM call it is evaluating.
- Heuristic checks: response is non-empty, above minimum length, does not start with "I cannot" or "I'm sorry"
- Format check: if structured output (JSON) was requested, validate schema compliance
- Confidence from logprobs: if provider exposes log-probabilities, low mean logprob signals uncertainty
- Small classifier: binary "acceptable / not acceptable" classifier trained on labelled examples of good and bad responses for your use case
- What not to use as quality gate: another LLM call — this defeats the cost savings and doubles latency
Checklist: Do You Understand This?
- Name four dimensions you can route on — and give a concrete rule for each.
- Why should you start with rule-based classification before adding a machine learning classifier?
- In the cascade pattern, when does cascading add more latency than going directly to the capable model?
- What is LiteLLM and what problem does it solve in a multi-provider routing architecture?
- What makes a good quality gate — and what makes a bad one?
- Design a routing policy for a customer support chatbot that needs to handle both simple FAQ questions and complex billing disputes.