🧠 All Things AI
Advanced

Multi-Model Routing

Not every request needs the most capable model. Routing requests to the right model for the task — based on complexity, latency requirements, cost, and provider availability — is one of the highest-leverage architectural decisions in an enterprise AI system. Done well, routing cuts costs by 60-80% with no user-visible quality degradation.

Routing Dimensions

DimensionWhat it controlsExample routing rule
Task complexitySimple → cheap fast model; complex → capable expensive modelClassify as "simple" → Haiku; "complex" → Sonnet; "reasoning" → Opus
Latency SLAUser-facing interactive → low latency model; async batch → higher latency acceptableInteractive chat → GPT-4o-mini (fast); overnight batch → GPT-4o (thorough)
Capability matchRoute to model with the strongest capability for the task typeCode tasks → Claude Sonnet (strong coding); vision → GPT-4o (multimodal)
Cost budgetWhen daily budget approaches ceiling, route to cheaper modelIf daily spend > 80% of budget → downgrade all non-critical requests to cheaper tier
Provider availabilityIf primary provider is down or rate-limited, route to secondaryAnthropic 5xx → failover to OpenAI equivalent; Anthropic 429 → failover to Bedrock

Classification-Based Routing

A small, fast classifier decides which model handles each request before the request reaches the LLM. The classifier itself should be cheap — a rule-based heuristic, a small embedding classifier, or a sub-100ms LLM call.

def classify_and_route(request: LLMRequest) -> str:

# Rule-based fast path — no ML classifier overhead

if request.token_count < 500 and "summarise" in request.system_prompt:

return "claude-haiku-4-5" # fast + cheap for simple summarisation

if request.requires_code_execution or request.context_length > 50_000:

return "claude-sonnet-4-6" # capable model for complex tasks

if request.use_case == "financial_analysis" and request.requires_reasoning:

return "claude-opus-4-6" # highest capability for high-stakes reasoning

# Default: mid-tier model for everything else

return "claude-sonnet-4-6"

Rule-based classification (recommended first)

  • No latency overhead — runs in microseconds
  • Deterministic and auditable — easy to debug misrouting
  • Based on: token count, use case tag, system prompt pattern, feature flags
  • Start here before adding ML classifiers — simpler rules catch 80% of cases

ML classifier (when rules are insufficient)

  • Embed the query; classify using a lightweight model (BERT-class)
  • Adds 10-50ms latency for the classification step
  • Useful when task complexity cannot be determined from metadata alone
  • Requires labelled training data; maintain eval set to catch classifier drift

Cascade Pattern

The cascade pattern tries the fast cheap model first. If the response does not meet a quality threshold, it falls back to the capable expensive model. The quality check must be fast — a heuristic or small classifier, not another LLM call.

StageActionQuality gate
Stage 1Send to fast cheap model (e.g., Haiku / GPT-4o-mini)Check: response length reasonable? No refusal? Confidence score above threshold?
Stage 2 (if gate fails)Send to capable model (e.g., Sonnet / GPT-4o)Serve this response; log that fallback was triggered

Cascade adds latency on fallback

When the Stage 1 model fails the quality gate, the total latency is Stage 1 time + quality gate time + Stage 2 time. This is higher than going directly to Stage 2. Cascade is worth it when the majority of requests pass Stage 1 — if fallback rate exceeds 40%, route directly to Stage 2 instead.

Provider Failover

A provider abstraction layer makes failover transparent to the rest of your application. The application calls a single interface; the router handles provider selection and failover.

LiteLLM (open source, recommended)

  • Unified OpenAI-compatible API for 100+ models and providers
  • Built-in fallback lists: if primary fails, try secondary in order
  • Built-in retry logic, load balancing, and budget management
  • Runs as a proxy service or as a Python library
  • 2025 status: widely adopted in enterprise; supports Anthropic/OpenAI/Bedrock/Vertex/Cohere/local

LiteLLM failover config

model_list:

- model_name: claude-primary

litellm_params:

model: anthropic/claude-sonnet-4-6

- model_name: claude-primary

litellm_params:

model: openai/gpt-4o # fallback

router_settings:

routing_strategy: simple-shuffle

num_retries: 2

fallbacks: [{"claude-primary": ["openai/gpt-4o"]}]

Quality Gate Implementation

A quality gate checks whether the fast model's output is good enough before serving it. It must be significantly cheaper and faster than the LLM call it is evaluating.

  • Heuristic checks: response is non-empty, above minimum length, does not start with "I cannot" or "I'm sorry"
  • Format check: if structured output (JSON) was requested, validate schema compliance
  • Confidence from logprobs: if provider exposes log-probabilities, low mean logprob signals uncertainty
  • Small classifier: binary "acceptable / not acceptable" classifier trained on labelled examples of good and bad responses for your use case
  • What not to use as quality gate: another LLM call — this defeats the cost savings and doubles latency

Checklist: Do You Understand This?

  • Name four dimensions you can route on — and give a concrete rule for each.
  • Why should you start with rule-based classification before adding a machine learning classifier?
  • In the cascade pattern, when does cascading add more latency than going directly to the capable model?
  • What is LiteLLM and what problem does it solve in a multi-provider routing architecture?
  • What makes a good quality gate — and what makes a bad one?
  • Design a routing policy for a customer support chatbot that needs to handle both simple FAQ questions and complex billing disputes.