Advanced

Hybrid Router Pattern

Not every query needs your most expensive model. A hybrid router dynamically directs requests to the right model based on task complexity, cost constraints, data sensitivity, and latency requirements — delivering frontier quality where it matters while keeping average cost per query low.

Why Hybrid Routing

The cost spread between the cheapest and most expensive frontier models is roughly 100×. Claude Haiku and Gemini Flash process ~1M tokens for under $1; Claude Opus and GPT-5 cost $15–$75 per million tokens. For applications that handle thousands of queries per day, routing even 70% of requests to a cheaper tier can reduce model costs by 60–80% with negligible quality loss on routine tasks.

Tier	Models	Cost (input/M tokens, approx.)	Best for
Cheap / Fast	Claude Haiku, Gemini Flash, GPT-4o mini, Llama 3.1 8B	$0.08–$0.30	Simple Q&A, classification, extraction, short summaries
Mid-tier	Claude Sonnet, GPT-4o, Gemini Pro, Llama 3.1 70B	$1–$5	Most everyday tasks — coding help, analysis, longer documents
Premium	Claude Opus, GPT-5, o3, Gemini Ultra	$15–$75	Complex reasoning, hard coding problems, research synthesis
Local / On-premise	Llama 3.1 8B/70B, Mistral, Phi-3, DeepSeek-R1	Hardware + electricity only	Privacy-sensitive data, offline requirements, high-volume cost control

Routing Dimensions

A router evaluates each incoming request across multiple dimensions simultaneously:

Task complexity

Is this a factual lookup, a simple rewrite, or a multi-step reasoning problem? Simple tasks (FAQ answers, entity extraction, classification) route to cheap tier. Complex reasoning, code generation, or ambiguous multi-constraint tasks route to premium tier. The classifier learns to distinguish these.

Data sensitivity

Queries containing PII, financial data, health records, or confidential IP must not be sent to external cloud APIs. Route these to an on-premise or private cloud model regardless of complexity. This is a hard constraint that overrides cost optimisation.

Latency requirements

Real-time interfaces (chat, voice, autocomplete) require low-latency models. Background batch jobs (nightly report generation, email digests) can use slower premium models without affecting user experience. Route based on the calling context, not just the query.

Cost budget

Per-user, per-day, or per-tenant cost budgets. When a user has exceeded their budget tier, route to cheaper models. Enterprise plans get access to premium tier; free tier users get cheap-tier only. The router enforces these policies programmatically.

Classifier-Based Routing

The most reliable routing approach uses a lightweight classifier to predict which model tier is appropriate for a given query:

Incoming query

User request or API call

→

PII detector

Regex + NER → force LOCAL if detected

→

Complexity classifier

Small LLM or BERT → simple / medium / complex

→

Budget check

User within plan tier limits?

→

Route

Cheap → Mid → Premium → Local

PII check is a hard gate — data sensitivity routing overrides all cost optimisation decisions

Options for the complexity classifier

Approach	Accuracy	Latency overhead	Notes
Small LLM call (Haiku / Phi-3)	High	100–300ms	Prompt the small model to classify the query; adds cost but works well out of the box
Fine-tuned text classifier (BERT-size)	Very high	5–20ms	Train on labelled routing examples; near-zero overhead; best for production
Embedding similarity	Medium	20–50ms	Embed query; find nearest labelled examples; decent for well-separated categories
Heuristic rules	Low–medium	<1ms	Query length, keyword presence, structured vs free-form; useful as a fast first pass

RouteLLM (open-source, LMSys) is a purpose-built routing framework with pre-trained classifiers. It trains on human preference data from Chatbot Arena and can reduce model costs by 40–85% while maintaining 95%+ of the quality of always-using the premium model.

Rule-Based Routing

For predictable task categories, deterministic rules are simpler, faster, and more auditable than a classifier:

By endpoint or feature — code completion always uses mid-tier; customer support FAQ uses cheap tier; technical analysis uses premium tier
By query length — under 50 tokens: cheap; 50–500 tokens: mid; over 500 tokens or document-length: premium
By keyword/pattern — queries matching “calculate”, “compare”, “analyse” route up; simple question patterns (“what is”, “define”, “how many”) route down
By user tier — API key's plan determines the maximum tier available

Data Sensitivity Routing

Data sensitivity routing is the most critical routing dimension from a compliance perspective. It must run before any other routing decision:

Data type	Detection method	Required routing
PII (names, emails, SSN, phone numbers)	Regex + NER model (spaCy, AWS Comprehend)	On-premise model or approved private cloud region
Financial data (account numbers, transactions)	Pattern matching + context classification	On-premise or vendor with BAA/DPA agreement
Health records (PHI under HIPAA)	PHI entity recognition	HIPAA-compliant vendor only (Azure OpenAI, AWS Bedrock)
Confidential IP / trade secrets	Document classification, user-flagged content	On-premise or enterprise contract with data non-training guarantee
Non-sensitive	No flags triggered	Any tier, cost-optimise freely

Fallback Chains

A fallback chain handles model unavailability, rate limits, and quality failures gracefully rather than returning an error to the user:

Fallback chain — try in order until one succeeds

Claude Sonnet

Primary — retry once on 429/5xx

GPT-4o

Fallback 1 — on second failure

Gemini Pro

Fallback 2

Llama 70B (local)

Fallback 3 — no network required

Error response

All failed → return with retry guidance

Circuit breaker per model: 3+ consecutive errors → mark degraded for 60s → skip to fallback

Fallback chains require tracking which models are currently experiencing degraded performance. Maintain a circuit breaker per model — if a model returns 3+ consecutive errors, mark it as degraded and skip directly to the fallback for subsequent requests (reset after 60 seconds).

Local + Cloud Hybrid

The most cost-effective production pattern combines a local open-weight model for routine traffic with cloud APIs for tasks that require frontier capability:

Local model handles

Simple Q&A, classification, extraction
All queries containing sensitive data
High-volume, low-stakes tasks
Offline or air-gapped requirements
Real-time latency requirements (avoid network)

Cloud API handles

Complex reasoning and multi-step analysis
Long-document understanding (>64K tokens)
Hard coding and debugging tasks
Tasks where local model outputs have been poor
Multimodal inputs if local model lacks capability

A GPU machine running Llama 3.1 70B can handle ~50–100 concurrent requests at ~$0.001 per request in electricity cost. Compared to Claude Sonnet at $3/M tokens, the break-even is roughly 10,000 requests per month — achievable for any moderate-traffic application.

Cost Measurement and Quality Testing

A routing system must be continuously monitored to validate that it is making the right trade-offs:

Cost tracking

Log every routing decision: query ID, routed tier, model used, token counts, cost estimate
Track cost per query distribution — not just average (outliers can dominate total cost)
Track routing distribution: what % of queries go to each tier, and how this changes over time
Alert when a routing tier's share shifts significantly — may indicate a classifier drift or changed traffic patterns

Quality monitoring

A/B test routing against a single-model baseline on a sample of queries
Use LLM-as-judge to score routed outputs vs premium model outputs on the same queries
Track user feedback signals (thumbs down, regenerate clicks) broken down by routed tier
Set a quality floor: if cheap-tier rejection rate exceeds 15%, move the boundary up to mid-tier

Implementation Reference

Component	Option	Notes
Routing framework	RouteLLM (open-source), custom middleware	RouteLLM has pre-trained classifiers ready to use
PII detection	spaCy NER, AWS Comprehend, Microsoft Presidio	Presidio is open-source, enterprise-grade, configurable
Circuit breaker	resilience4j (Java), pybreaker (Python), custom	Per-model circuit breaker; reset after 60s
Local model hosting	Ollama, vLLM, Llama.cpp	vLLM for high-throughput production; Ollama for simpler setups
Cost tracking	Langfuse, LangSmith, custom telemetry	Tag all LLM calls with routing tier and decision metadata
Multi-provider SDK	LiteLLM	Single API surface across 100+ models; built-in fallback and retry

LiteLLM: the practical foundation

LiteLLM provides a unified OpenAI-compatible API across Anthropic, Google, Azure, Cohere, local Ollama, and 100+ other providers. It handles retries, fallbacks, and load balancing out of the box. Building a hybrid router on top of LiteLLM means you only write the routing decision logic — not the per-provider integration code.

Checklist: Do You Understand This?

Why does hybrid routing reduce model costs by 60–80% on most applications?
What are the four routing dimensions and which one is a hard constraint that overrides the others?
What is a complexity classifier and what are three ways to implement one?
How does a fallback chain differ from a simple retry, and what is a circuit breaker?
At what request volume does a local model become cost-competitive with a $3/M token cloud API?
What metrics should you track to validate that routing decisions are not degrading quality?
What is LiteLLM and why is it a useful foundation for a hybrid router implementation?