🧠 All Things AI
Advanced

Hybrid Router Pattern

Not every query needs your most expensive model. A hybrid router dynamically directs requests to the right model based on task complexity, cost constraints, data sensitivity, and latency requirements β€” delivering frontier quality where it matters while keeping average cost per query low.

Why Hybrid Routing

The cost spread between the cheapest and most expensive frontier models is roughly 100Γ—. Claude Haiku and Gemini Flash process ~1M tokens for under $1; Claude Opus and GPT-5 cost $15–$75 per million tokens. For applications that handle thousands of queries per day, routing even 70% of requests to a cheaper tier can reduce model costs by 60–80% with negligible quality loss on routine tasks.

TierModelsCost (input/M tokens, approx.)Best for
Cheap / FastClaude Haiku, Gemini Flash, GPT-4o mini, Llama 3.1 8B$0.08–$0.30Simple Q&A, classification, extraction, short summaries
Mid-tierClaude Sonnet, GPT-4o, Gemini Pro, Llama 3.1 70B$1–$5Most everyday tasks β€” coding help, analysis, longer documents
PremiumClaude Opus, GPT-5, o3, Gemini Ultra$15–$75Complex reasoning, hard coding problems, research synthesis
Local / On-premiseLlama 3.1 8B/70B, Mistral, Phi-3, DeepSeek-R1Hardware + electricity onlyPrivacy-sensitive data, offline requirements, high-volume cost control

Routing Dimensions

A router evaluates each incoming request across multiple dimensions simultaneously:

Task complexity

Is this a factual lookup, a simple rewrite, or a multi-step reasoning problem? Simple tasks (FAQ answers, entity extraction, classification) route to cheap tier. Complex reasoning, code generation, or ambiguous multi-constraint tasks route to premium tier. The classifier learns to distinguish these.

Data sensitivity

Queries containing PII, financial data, health records, or confidential IP must not be sent to external cloud APIs. Route these to an on-premise or private cloud model regardless of complexity. This is a hard constraint that overrides cost optimisation.

Latency requirements

Real-time interfaces (chat, voice, autocomplete) require low-latency models. Background batch jobs (nightly report generation, email digests) can use slower premium models without affecting user experience. Route based on the calling context, not just the query.

Cost budget

Per-user, per-day, or per-tenant cost budgets. When a user has exceeded their budget tier, route to cheaper models. Enterprise plans get access to premium tier; free tier users get cheap-tier only. The router enforces these policies programmatically.

Classifier-Based Routing

The most reliable routing approach uses a lightweight classifier to predict which model tier is appropriate for a given query:

Incoming query
User request or API call
β†’
PII detector
Regex + NER β†’ force LOCAL if detected
β†’
Complexity classifier
Small LLM or BERT β†’ simple / medium / complex
β†’
Budget check
User within plan tier limits?
β†’
Route
Cheap β†’ Mid β†’ Premium β†’ Local

PII check is a hard gate β€” data sensitivity routing overrides all cost optimisation decisions

Options for the complexity classifier

ApproachAccuracyLatency overheadNotes
Small LLM call (Haiku / Phi-3)High100–300msPrompt the small model to classify the query; adds cost but works well out of the box
Fine-tuned text classifier (BERT-size)Very high5–20msTrain on labelled routing examples; near-zero overhead; best for production
Embedding similarityMedium20–50msEmbed query; find nearest labelled examples; decent for well-separated categories
Heuristic rulesLow–medium<1msQuery length, keyword presence, structured vs free-form; useful as a fast first pass

RouteLLM (open-source, LMSys) is a purpose-built routing framework with pre-trained classifiers. It trains on human preference data from Chatbot Arena and can reduce model costs by 40–85% while maintaining 95%+ of the quality of always-using the premium model.

Rule-Based Routing

For predictable task categories, deterministic rules are simpler, faster, and more auditable than a classifier:

  • By endpoint or feature β€” code completion always uses mid-tier; customer support FAQ uses cheap tier; technical analysis uses premium tier
  • By query length β€” under 50 tokens: cheap; 50–500 tokens: mid; over 500 tokens or document-length: premium
  • By keyword/pattern β€” queries matching β€œcalculate”, β€œcompare”, β€œanalyse” route up; simple question patterns (β€œwhat is”, β€œdefine”, β€œhow many”) route down
  • By user tier β€” API key's plan determines the maximum tier available

Data Sensitivity Routing

Data sensitivity routing is the most critical routing dimension from a compliance perspective. It must run before any other routing decision:

Data typeDetection methodRequired routing
PII (names, emails, SSN, phone numbers)Regex + NER model (spaCy, AWS Comprehend)On-premise model or approved private cloud region
Financial data (account numbers, transactions)Pattern matching + context classificationOn-premise or vendor with BAA/DPA agreement
Health records (PHI under HIPAA)PHI entity recognitionHIPAA-compliant vendor only (Azure OpenAI, AWS Bedrock)
Confidential IP / trade secretsDocument classification, user-flagged contentOn-premise or enterprise contract with data non-training guarantee
Non-sensitiveNo flags triggeredAny tier, cost-optimise freely

Fallback Chains

A fallback chain handles model unavailability, rate limits, and quality failures gracefully rather than returning an error to the user:

Fallback chain β€” try in order until one succeeds
Claude Sonnet
Primary β€” retry once on 429/5xx
GPT-4o
Fallback 1 β€” on second failure
Gemini Pro
Fallback 2
Llama 70B (local)
Fallback 3 β€” no network required
Error response
All failed β†’ return with retry guidance

Circuit breaker per model: 3+ consecutive errors β†’ mark degraded for 60s β†’ skip to fallback

Fallback chains require tracking which models are currently experiencing degraded performance. Maintain a circuit breaker per model β€” if a model returns 3+ consecutive errors, mark it as degraded and skip directly to the fallback for subsequent requests (reset after 60 seconds).

Local + Cloud Hybrid

The most cost-effective production pattern combines a local open-weight model for routine traffic with cloud APIs for tasks that require frontier capability:

Local model handles

  • Simple Q&A, classification, extraction
  • All queries containing sensitive data
  • High-volume, low-stakes tasks
  • Offline or air-gapped requirements
  • Real-time latency requirements (avoid network)

Cloud API handles

  • Complex reasoning and multi-step analysis
  • Long-document understanding (>64K tokens)
  • Hard coding and debugging tasks
  • Tasks where local model outputs have been poor
  • Multimodal inputs if local model lacks capability

A GPU machine running Llama 3.1 70B can handle ~50–100 concurrent requests at ~$0.001 per request in electricity cost. Compared to Claude Sonnet at $3/M tokens, the break-even is roughly 10,000 requests per month β€” achievable for any moderate-traffic application.

Cost Measurement and Quality Testing

A routing system must be continuously monitored to validate that it is making the right trade-offs:

Cost tracking

  • Log every routing decision: query ID, routed tier, model used, token counts, cost estimate
  • Track cost per query distribution β€” not just average (outliers can dominate total cost)
  • Track routing distribution: what % of queries go to each tier, and how this changes over time
  • Alert when a routing tier's share shifts significantly β€” may indicate a classifier drift or changed traffic patterns

Quality monitoring

  • A/B test routing against a single-model baseline on a sample of queries
  • Use LLM-as-judge to score routed outputs vs premium model outputs on the same queries
  • Track user feedback signals (thumbs down, regenerate clicks) broken down by routed tier
  • Set a quality floor: if cheap-tier rejection rate exceeds 15%, move the boundary up to mid-tier

Implementation Reference

ComponentOptionNotes
Routing frameworkRouteLLM (open-source), custom middlewareRouteLLM has pre-trained classifiers ready to use
PII detectionspaCy NER, AWS Comprehend, Microsoft PresidioPresidio is open-source, enterprise-grade, configurable
Circuit breakerresilience4j (Java), pybreaker (Python), customPer-model circuit breaker; reset after 60s
Local model hostingOllama, vLLM, Llama.cppvLLM for high-throughput production; Ollama for simpler setups
Cost trackingLangfuse, LangSmith, custom telemetryTag all LLM calls with routing tier and decision metadata
Multi-provider SDKLiteLLMSingle API surface across 100+ models; built-in fallback and retry

LiteLLM: the practical foundation

LiteLLM provides a unified OpenAI-compatible API across Anthropic, Google, Azure, Cohere, local Ollama, and 100+ other providers. It handles retries, fallbacks, and load balancing out of the box. Building a hybrid router on top of LiteLLM means you only write the routing decision logic β€” not the per-provider integration code.

Checklist: Do You Understand This?

  • Why does hybrid routing reduce model costs by 60–80% on most applications?
  • What are the four routing dimensions and which one is a hard constraint that overrides the others?
  • What is a complexity classifier and what are three ways to implement one?
  • How does a fallback chain differ from a simple retry, and what is a circuit breaker?
  • At what request volume does a local model become cost-competitive with a $3/M token cloud API?
  • What metrics should you track to validate that routing decisions are not degrading quality?
  • What is LiteLLM and why is it a useful foundation for a hybrid router implementation?