Hybrid Router Pattern
Not every query needs your most expensive model. A hybrid router dynamically directs requests to the right model based on task complexity, cost constraints, data sensitivity, and latency requirements β delivering frontier quality where it matters while keeping average cost per query low.
Why Hybrid Routing
The cost spread between the cheapest and most expensive frontier models is roughly 100Γ. Claude Haiku and Gemini Flash process ~1M tokens for under $1; Claude Opus and GPT-5 cost $15β$75 per million tokens. For applications that handle thousands of queries per day, routing even 70% of requests to a cheaper tier can reduce model costs by 60β80% with negligible quality loss on routine tasks.
| Tier | Models | Cost (input/M tokens, approx.) | Best for |
|---|---|---|---|
| Cheap / Fast | Claude Haiku, Gemini Flash, GPT-4o mini, Llama 3.1 8B | $0.08β$0.30 | Simple Q&A, classification, extraction, short summaries |
| Mid-tier | Claude Sonnet, GPT-4o, Gemini Pro, Llama 3.1 70B | $1β$5 | Most everyday tasks β coding help, analysis, longer documents |
| Premium | Claude Opus, GPT-5, o3, Gemini Ultra | $15β$75 | Complex reasoning, hard coding problems, research synthesis |
| Local / On-premise | Llama 3.1 8B/70B, Mistral, Phi-3, DeepSeek-R1 | Hardware + electricity only | Privacy-sensitive data, offline requirements, high-volume cost control |
Routing Dimensions
A router evaluates each incoming request across multiple dimensions simultaneously:
Task complexity
Is this a factual lookup, a simple rewrite, or a multi-step reasoning problem? Simple tasks (FAQ answers, entity extraction, classification) route to cheap tier. Complex reasoning, code generation, or ambiguous multi-constraint tasks route to premium tier. The classifier learns to distinguish these.
Data sensitivity
Queries containing PII, financial data, health records, or confidential IP must not be sent to external cloud APIs. Route these to an on-premise or private cloud model regardless of complexity. This is a hard constraint that overrides cost optimisation.
Latency requirements
Real-time interfaces (chat, voice, autocomplete) require low-latency models. Background batch jobs (nightly report generation, email digests) can use slower premium models without affecting user experience. Route based on the calling context, not just the query.
Cost budget
Per-user, per-day, or per-tenant cost budgets. When a user has exceeded their budget tier, route to cheaper models. Enterprise plans get access to premium tier; free tier users get cheap-tier only. The router enforces these policies programmatically.
Classifier-Based Routing
The most reliable routing approach uses a lightweight classifier to predict which model tier is appropriate for a given query:
PII check is a hard gate β data sensitivity routing overrides all cost optimisation decisions
Options for the complexity classifier
| Approach | Accuracy | Latency overhead | Notes |
|---|---|---|---|
| Small LLM call (Haiku / Phi-3) | High | 100β300ms | Prompt the small model to classify the query; adds cost but works well out of the box |
| Fine-tuned text classifier (BERT-size) | Very high | 5β20ms | Train on labelled routing examples; near-zero overhead; best for production |
| Embedding similarity | Medium | 20β50ms | Embed query; find nearest labelled examples; decent for well-separated categories |
| Heuristic rules | Lowβmedium | <1ms | Query length, keyword presence, structured vs free-form; useful as a fast first pass |
RouteLLM (open-source, LMSys) is a purpose-built routing framework with pre-trained classifiers. It trains on human preference data from Chatbot Arena and can reduce model costs by 40β85% while maintaining 95%+ of the quality of always-using the premium model.
Rule-Based Routing
For predictable task categories, deterministic rules are simpler, faster, and more auditable than a classifier:
- By endpoint or feature β code completion always uses mid-tier; customer support FAQ uses cheap tier; technical analysis uses premium tier
- By query length β under 50 tokens: cheap; 50β500 tokens: mid; over 500 tokens or document-length: premium
- By keyword/pattern β queries matching βcalculateβ, βcompareβ, βanalyseβ route up; simple question patterns (βwhat isβ, βdefineβ, βhow manyβ) route down
- By user tier β API key's plan determines the maximum tier available
Data Sensitivity Routing
Data sensitivity routing is the most critical routing dimension from a compliance perspective. It must run before any other routing decision:
| Data type | Detection method | Required routing |
|---|---|---|
| PII (names, emails, SSN, phone numbers) | Regex + NER model (spaCy, AWS Comprehend) | On-premise model or approved private cloud region |
| Financial data (account numbers, transactions) | Pattern matching + context classification | On-premise or vendor with BAA/DPA agreement |
| Health records (PHI under HIPAA) | PHI entity recognition | HIPAA-compliant vendor only (Azure OpenAI, AWS Bedrock) |
| Confidential IP / trade secrets | Document classification, user-flagged content | On-premise or enterprise contract with data non-training guarantee |
| Non-sensitive | No flags triggered | Any tier, cost-optimise freely |
Fallback Chains
A fallback chain handles model unavailability, rate limits, and quality failures gracefully rather than returning an error to the user:
Circuit breaker per model: 3+ consecutive errors β mark degraded for 60s β skip to fallback
Fallback chains require tracking which models are currently experiencing degraded performance. Maintain a circuit breaker per model β if a model returns 3+ consecutive errors, mark it as degraded and skip directly to the fallback for subsequent requests (reset after 60 seconds).
Local + Cloud Hybrid
The most cost-effective production pattern combines a local open-weight model for routine traffic with cloud APIs for tasks that require frontier capability:
Local model handles
- Simple Q&A, classification, extraction
- All queries containing sensitive data
- High-volume, low-stakes tasks
- Offline or air-gapped requirements
- Real-time latency requirements (avoid network)
Cloud API handles
- Complex reasoning and multi-step analysis
- Long-document understanding (>64K tokens)
- Hard coding and debugging tasks
- Tasks where local model outputs have been poor
- Multimodal inputs if local model lacks capability
A GPU machine running Llama 3.1 70B can handle ~50β100 concurrent requests at ~$0.001 per request in electricity cost. Compared to Claude Sonnet at $3/M tokens, the break-even is roughly 10,000 requests per month β achievable for any moderate-traffic application.
Cost Measurement and Quality Testing
A routing system must be continuously monitored to validate that it is making the right trade-offs:
Cost tracking
- Log every routing decision: query ID, routed tier, model used, token counts, cost estimate
- Track cost per query distribution β not just average (outliers can dominate total cost)
- Track routing distribution: what % of queries go to each tier, and how this changes over time
- Alert when a routing tier's share shifts significantly β may indicate a classifier drift or changed traffic patterns
Quality monitoring
- A/B test routing against a single-model baseline on a sample of queries
- Use LLM-as-judge to score routed outputs vs premium model outputs on the same queries
- Track user feedback signals (thumbs down, regenerate clicks) broken down by routed tier
- Set a quality floor: if cheap-tier rejection rate exceeds 15%, move the boundary up to mid-tier
Implementation Reference
| Component | Option | Notes |
|---|---|---|
| Routing framework | RouteLLM (open-source), custom middleware | RouteLLM has pre-trained classifiers ready to use |
| PII detection | spaCy NER, AWS Comprehend, Microsoft Presidio | Presidio is open-source, enterprise-grade, configurable |
| Circuit breaker | resilience4j (Java), pybreaker (Python), custom | Per-model circuit breaker; reset after 60s |
| Local model hosting | Ollama, vLLM, Llama.cpp | vLLM for high-throughput production; Ollama for simpler setups |
| Cost tracking | Langfuse, LangSmith, custom telemetry | Tag all LLM calls with routing tier and decision metadata |
| Multi-provider SDK | LiteLLM | Single API surface across 100+ models; built-in fallback and retry |
LiteLLM: the practical foundation
LiteLLM provides a unified OpenAI-compatible API across Anthropic, Google, Azure, Cohere, local Ollama, and 100+ other providers. It handles retries, fallbacks, and load balancing out of the box. Building a hybrid router on top of LiteLLM means you only write the routing decision logic β not the per-provider integration code.
Checklist: Do You Understand This?
- Why does hybrid routing reduce model costs by 60β80% on most applications?
- What are the four routing dimensions and which one is a hard constraint that overrides the others?
- What is a complexity classifier and what are three ways to implement one?
- How does a fallback chain differ from a simple retry, and what is a circuit breaker?
- At what request volume does a local model become cost-competitive with a $3/M token cloud API?
- What metrics should you track to validate that routing decisions are not degrading quality?
- What is LiteLLM and why is it a useful foundation for a hybrid router implementation?