Model Routing: Cheap-First Cascades That Actually Work
We were spending $18,000/month on Claude Sonnet for a developer assistant tool. Ninety percent of the queries were things like "what does this error mean?" and "how do I do X in Python?" — questions Haiku answers correctly. We were paying Sonnet prices for Haiku-level work.
After implementing a routing layer, monthly spend dropped to $4,200. The degradation in user-reported quality: undetectable. The catch: we got the routing logic wrong twice before it worked, and the second failure was worse than not routing at all.
This post is about building cascades that work, not just cascades that save money.
The Core Idea
Model routing is the practice of sending queries to the cheapest model that can answer them correctly, and escalating to more capable (more expensive) models only when needed.
~$0.25/M tokens] R -->|Medium complexity| S[Sonnet
~$3/M tokens] R -->|Complex / High stakes| O[Opus
~$15/M tokens] H -->|Confidence too low| S S -->|Confidence too low| O H --> A[Answer] S --> A O --> A
The router can be a separate classifier model, a set of heuristics, or the cheap model itself (self-routing). Each approach has different tradeoffs.
Routing Strategies
1. Heuristic Routing
The simplest approach — classify queries before they hit any LLM based on measurable properties.
from dataclasses import dataclass
from enum import Enum
import re
class Tier(Enum):
HAIKU = "claude-haiku-3-5"
SONNET = "claude-sonnet-4-5"
OPUS = "claude-opus-4"
@dataclass
class RoutingSignals:
token_count: int
has_code_block: bool
has_multi_step_instruction: bool
query_type: str # "factual" | "analytical" | "creative" | "coding"
def heuristic_route(signals: RoutingSignals) -> Tier:
# Immediate Opus escalation triggers
if signals.token_count > 4000:
return Tier.OPUS
if signals.query_type == "analytical" and signals.has_multi_step_instruction:
return Tier.OPUS
# Sonnet triggers
if signals.query_type in ("coding", "analytical"):
return Tier.SONNET
if signals.has_code_block and signals.token_count > 500:
return Tier.SONNET
# Default: Haiku
return Tier.HAIKU
def extract_signals(query: str) -> RoutingSignals:
return RoutingSignals(
token_count=len(query.split()) * 1.3, # rough approximation
has_code_block=bool(re.search(r"```", query)),
has_multi_step_instruction=bool(
re.search(r"\b(then|after that|next|finally|step \d)\b", query, re.I)
),
query_type=classify_query_type(query), # small classifier or keyword match
)Heuristic routing is fast (no LLM call overhead), predictable, and debuggable. Its weakness: it can't see query difficulty, only query shape. A simple-looking question about obscure kernel behavior will route to Haiku and come back wrong.
2. Self-Routing (Cheap Model Confidence)
Let Haiku attempt the query and escalate based on its own expressed confidence.
import anthropic
import json
client = anthropic.Anthropic()
CONFIDENCE_PROMPT = """Answer the following query. At the end of your response,
output a JSON block with your confidence:
{"confidence": <0.0-1.0>, "needs_escalation": <true|false>, "reason": "<why>"}
Only set needs_escalation to true if you're genuinely uncertain or the question
requires capabilities you don't have. Be honest."""
def self_routing_query(user_query: str) -> dict:
haiku_response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=1024,
messages=[
{"role": "user", "content": f"{CONFIDENCE_PROMPT}\n\nQuery: {user_query}"}
],
)
content = haiku_response.content[0].text
# Parse confidence block
try:
json_match = re.search(r'\{[^}]*"confidence"[^}]*\}', content, re.DOTALL)
confidence_data = json.loads(json_match.group()) if json_match else {}
except (json.JSONDecodeError, AttributeError):
confidence_data = {"confidence": 0.5, "needs_escalation": False}
needs_escalation = (
confidence_data.get("needs_escalation", False)
or confidence_data.get("confidence", 1.0) < 0.75
)
if needs_escalation:
return escalate_to_sonnet(user_query, haiku_response=content)
return {"answer": content, "model_used": "haiku", "cost_tier": "cheap"}The problem with self-routing: models are poorly calibrated on their own uncertainty. Haiku will confidently give a wrong answer without flagging escalation. You need an external calibration step.
3. Classifier-Based Routing (Recommended for Production)
Train or fine-tune a small classifier on labeled examples from your actual query distribution. The classifier predicts the minimum model tier needed to answer correctly.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
# Training data structure: (query, minimum_tier_label)
# Tier labels: 0=haiku, 1=sonnet, 2=opus
# Generate labels by running queries through all tiers and evaluating outputs
def build_routing_classifier(training_data: list[tuple[str, int]]) -> Pipeline:
queries, labels = zip(*training_data)
pipeline = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2),
max_features=10000,
sublinear_tf=True,
)),
("clf", LogisticRegression(
C=1.0,
class_weight="balanced", # important: escalation errors cost more than over-escalation
max_iter=1000,
)),
])
pipeline.fit(queries, labels)
return pipeline
def route_with_classifier(query: str, classifier: Pipeline) -> Tier:
probs = classifier.predict_proba([query])[0]
# Conservative routing: if Opus probability > 15%, use Sonnet at minimum
# Misrouting to a cheaper model is worse than misrouting to an expensive one
if probs[2] > 0.15: # Opus probability
return Tier.OPUS
if probs[1] > 0.30 or probs[2] > 0.05: # Sonnet probability
return Tier.SONNET
return Tier.HAIKUThe key insight: asymmetric error costs. Routing a hard query to Haiku (false negative) costs you accuracy. Routing an easy query to Opus (false positive) costs you money. Money is recoverable. Trust is not. Design your decision boundaries accordingly.
Escalation Logic
Pure static routing isn't enough. You need runtime escalation when confidence signals indicate a problem.
contradicts known facts? alt Passes eval EvalChecker-->>User: Return Haiku response else Fails eval EvalChecker->>Sonnet: Escalate with context Sonnet-->>EvalChecker: Response EvalChecker-->>User: Return Sonnet response EvalChecker->>Router: Log: this query type needs escalation end
from anthropic import Anthropic
client = Anthropic()
EVAL_PROMPT = """You are a quality evaluator. Given a query and a response,
determine if the response is:
1. Factually correct (to the best of your knowledge)
2. Complete (actually answers what was asked)
3. Coherent (not internally contradictory)
Respond with JSON only: {"passes": true/false, "issues": ["list of problems"]}"""
def eval_response_quality(query: str, response: str) -> dict:
# Use a mid-tier model for evaluation — not Haiku (same capability)
# not Opus (too expensive for every call)
eval_result = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=256,
messages=[{
"role": "user",
"content": f"{EVAL_PROMPT}\n\nQuery: {query}\n\nResponse: {response}"
}],
)
try:
return json.loads(eval_result.content[0].text)
except json.JSONDecodeError:
return {"passes": False, "issues": ["Eval parse error — escalating to be safe"]}
def cascade_query(query: str, router: Pipeline) -> dict:
initial_tier = route_with_classifier(query, router)
if initial_tier == Tier.HAIKU:
haiku_resp = call_model(Tier.HAIKU, query)
eval_result = eval_response_quality(query, haiku_resp)
if eval_result["passes"]:
return {"answer": haiku_resp, "model_used": "haiku", "escalated": False}
# Escalate to Sonnet
sonnet_resp = call_model(Tier.SONNET, query, prior_attempt=haiku_resp)
return {"answer": sonnet_resp, "model_used": "sonnet", "escalated": True,
"escalation_reason": eval_result["issues"]}
# ... handle other initial tiersNote the evaluation model choice: using Sonnet to evaluate Haiku's responses is intentional. If you use Haiku to evaluate itself, you inherit the same blind spots.
Cost vs. Quality Curves
Before building a router, instrument your current system to understand your baseline.
import statistics
from collections import defaultdict
def analyze_query_distribution(query_log: list[dict]) -> dict:
"""
query_log entries: {query, model_used, cost_usd, user_rating (1-5), tokens}
"""
by_model = defaultdict(list)
for entry in query_log:
by_model[entry["model_used"]].append(entry)
report = {}
for model, entries in by_model.items():
costs = [e["cost_usd"] for e in entries]
ratings = [e["user_rating"] for e in entries if e.get("user_rating")]
report[model] = {
"query_count": len(entries),
"pct_of_total": len(entries) / len(query_log) * 100,
"avg_cost": statistics.mean(costs),
"total_cost": sum(costs),
"avg_rating": statistics.mean(ratings) if ratings else None,
"p95_tokens": sorted([e["tokens"] for e in entries])[int(len(entries) * 0.95)],
}
return reportRun this for two weeks before building a router. You'll likely find that 60-80% of queries in most B2B apps fall in the "easy" category that Haiku can handle. That's your routing opportunity.
When the Cascade Fails
Three failure modes that will hurt you:
1. Cascading latency on escalation. If Haiku takes 1.5s, the eval takes 0.8s, and then Sonnet takes 2.5s, your p95 latency on escalated queries is 4.8s. Users notice.
# Mitigation: parallel-speculate for borderline queries
import asyncio
async def speculative_cascade(query: str, confidence: float) -> str:
if confidence < 0.6: # borderline — run both in parallel, use Sonnet if needed
haiku_task = asyncio.create_task(call_model_async(Tier.HAIKU, query))
sonnet_task = asyncio.create_task(call_model_async(Tier.SONNET, query))
haiku_result = await haiku_task
eval_result = eval_response_quality(query, haiku_result)
if eval_result["passes"]:
sonnet_task.cancel()
return haiku_result
else:
return await sonnet_task # already running, minimal additional latency
return await call_model_async(route_with_classifier(query, router), query)2. Distribution shift after deployment. Your router was trained on last month's queries. This month, users start asking a new type of question your router routes to Haiku incorrectly. Set up weekly accuracy sampling.
3. Eval model cost creep. If you're eval-ing 100% of Haiku responses with Sonnet, your eval cost can exceed your routing savings. Sample: run evals on 20% of Haiku responses, stratified by query type.
Debugging the Cascade
The hardest part of routing systems is debugging silent degradation — cases where queries route incorrectly but users don't complain loudly.
def routing_audit_report(query_log: list[dict], sample_size: int = 100) -> dict:
"""
Retrospectively checks if routed queries were handled at the right tier.
Re-runs a sample through the full eval pipeline.
"""
haiku_queries = [q for q in query_log if q["model_used"] == "haiku"]
sample = haiku_queries[:sample_size] # sample recent Haiku-answered queries
misrouted = []
for entry in sample:
eval_result = eval_response_quality(entry["query"], entry["response"])
if not eval_result["passes"]:
misrouted.append({
"query": entry["query"][:100],
"issues": eval_result["issues"],
"date": entry["timestamp"],
})
return {
"sample_size": len(sample),
"misrouted_count": len(misrouted),
"misroute_rate": len(misrouted) / len(sample),
"examples": misrouted[:10],
}Run this weekly. If misroute rate climbs above 5%, retrain your classifier on the new examples.
Key Takeaways
- Heuristic routing is the right starting point — ship something simple, measure, then add complexity.
- Classifier-based routing with asymmetric cost weighting (escalation errors are worse than conservative routing) outperforms self-routing by a significant margin in production.
- Parallelspeculation on borderline queries prevents the latency spike from dominating your p95 metrics.
- Never use the same model tier to evaluate responses from that tier — use one tier up for calibrated quality signals.
- Distribution shift is the silent killer of routing systems: schedule weekly accuracy audits on randomly sampled routed responses.
- The business case is real — 60-80% of queries in most B2B apps can route to the cheap tier without detectable quality loss, yielding 3-5x cost reduction.