Skip to main content
AI

Model Routing: Cheap-First Cascades That Actually Work

Ravinder··8 min read
AILLMCost OptimizationArchitectureRouting
Share:
Model Routing: Cheap-First Cascades That Actually Work

We were spending $18,000/month on Claude Sonnet for a developer assistant tool. Ninety percent of the queries were things like "what does this error mean?" and "how do I do X in Python?" — questions Haiku answers correctly. We were paying Sonnet prices for Haiku-level work.

After implementing a routing layer, monthly spend dropped to $4,200. The degradation in user-reported quality: undetectable. The catch: we got the routing logic wrong twice before it worked, and the second failure was worse than not routing at all.

This post is about building cascades that work, not just cascades that save money.

The Core Idea

Model routing is the practice of sending queries to the cheapest model that can answer them correctly, and escalating to more capable (more expensive) models only when needed.

graph LR Q[User Query] --> R[Router] R -->|Simple / Low complexity| H[Haiku
~$0.25/M tokens] R -->|Medium complexity| S[Sonnet
~$3/M tokens] R -->|Complex / High stakes| O[Opus
~$15/M tokens] H -->|Confidence too low| S S -->|Confidence too low| O H --> A[Answer] S --> A O --> A

The router can be a separate classifier model, a set of heuristics, or the cheap model itself (self-routing). Each approach has different tradeoffs.

Routing Strategies

1. Heuristic Routing

The simplest approach — classify queries before they hit any LLM based on measurable properties.

from dataclasses import dataclass
from enum import Enum
import re
 
class Tier(Enum):
    HAIKU = "claude-haiku-3-5"
    SONNET = "claude-sonnet-4-5"
    OPUS = "claude-opus-4"
 
@dataclass
class RoutingSignals:
    token_count: int
    has_code_block: bool
    has_multi_step_instruction: bool
    query_type: str  # "factual" | "analytical" | "creative" | "coding"
 
def heuristic_route(signals: RoutingSignals) -> Tier:
    # Immediate Opus escalation triggers
    if signals.token_count > 4000:
        return Tier.OPUS
    if signals.query_type == "analytical" and signals.has_multi_step_instruction:
        return Tier.OPUS
 
    # Sonnet triggers
    if signals.query_type in ("coding", "analytical"):
        return Tier.SONNET
    if signals.has_code_block and signals.token_count > 500:
        return Tier.SONNET
 
    # Default: Haiku
    return Tier.HAIKU
 
def extract_signals(query: str) -> RoutingSignals:
    return RoutingSignals(
        token_count=len(query.split()) * 1.3,  # rough approximation
        has_code_block=bool(re.search(r"```", query)),
        has_multi_step_instruction=bool(
            re.search(r"\b(then|after that|next|finally|step \d)\b", query, re.I)
        ),
        query_type=classify_query_type(query),  # small classifier or keyword match
    )

Heuristic routing is fast (no LLM call overhead), predictable, and debuggable. Its weakness: it can't see query difficulty, only query shape. A simple-looking question about obscure kernel behavior will route to Haiku and come back wrong.

2. Self-Routing (Cheap Model Confidence)

Let Haiku attempt the query and escalate based on its own expressed confidence.

import anthropic
import json
 
client = anthropic.Anthropic()
 
CONFIDENCE_PROMPT = """Answer the following query. At the end of your response, 
output a JSON block with your confidence:
 
{"confidence": <0.0-1.0>, "needs_escalation": <true|false>, "reason": "<why>"}
 
Only set needs_escalation to true if you're genuinely uncertain or the question 
requires capabilities you don't have. Be honest."""
 
def self_routing_query(user_query: str) -> dict:
    haiku_response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": f"{CONFIDENCE_PROMPT}\n\nQuery: {user_query}"}
        ],
    )
 
    content = haiku_response.content[0].text
 
    # Parse confidence block
    try:
        json_match = re.search(r'\{[^}]*"confidence"[^}]*\}', content, re.DOTALL)
        confidence_data = json.loads(json_match.group()) if json_match else {}
    except (json.JSONDecodeError, AttributeError):
        confidence_data = {"confidence": 0.5, "needs_escalation": False}
 
    needs_escalation = (
        confidence_data.get("needs_escalation", False)
        or confidence_data.get("confidence", 1.0) < 0.75
    )
 
    if needs_escalation:
        return escalate_to_sonnet(user_query, haiku_response=content)
 
    return {"answer": content, "model_used": "haiku", "cost_tier": "cheap"}

The problem with self-routing: models are poorly calibrated on their own uncertainty. Haiku will confidently give a wrong answer without flagging escalation. You need an external calibration step.

Train or fine-tune a small classifier on labeled examples from your actual query distribution. The classifier predicts the minimum model tier needed to answer correctly.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
 
# Training data structure: (query, minimum_tier_label)
# Tier labels: 0=haiku, 1=sonnet, 2=opus
# Generate labels by running queries through all tiers and evaluating outputs
 
def build_routing_classifier(training_data: list[tuple[str, int]]) -> Pipeline:
    queries, labels = zip(*training_data)
    
    pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 2),
            max_features=10000,
            sublinear_tf=True,
        )),
        ("clf", LogisticRegression(
            C=1.0,
            class_weight="balanced",  # important: escalation errors cost more than over-escalation
            max_iter=1000,
        )),
    ])
    
    pipeline.fit(queries, labels)
    return pipeline
 
def route_with_classifier(query: str, classifier: Pipeline) -> Tier:
    probs = classifier.predict_proba([query])[0]
    
    # Conservative routing: if Opus probability > 15%, use Sonnet at minimum
    # Misrouting to a cheaper model is worse than misrouting to an expensive one
    if probs[2] > 0.15:  # Opus probability
        return Tier.OPUS
    if probs[1] > 0.30 or probs[2] > 0.05:  # Sonnet probability
        return Tier.SONNET
    return Tier.HAIKU

The key insight: asymmetric error costs. Routing a hard query to Haiku (false negative) costs you accuracy. Routing an easy query to Opus (false positive) costs you money. Money is recoverable. Trust is not. Design your decision boundaries accordingly.

Escalation Logic

Pure static routing isn't enough. You need runtime escalation when confidence signals indicate a problem.

sequenceDiagram participant User participant Router participant Haiku participant Sonnet participant EvalChecker User->>Router: Query Router->>Haiku: Route to cheap tier Haiku-->>EvalChecker: Response EvalChecker->>EvalChecker: Check: coherent? complete?
contradicts known facts? alt Passes eval EvalChecker-->>User: Return Haiku response else Fails eval EvalChecker->>Sonnet: Escalate with context Sonnet-->>EvalChecker: Response EvalChecker-->>User: Return Sonnet response EvalChecker->>Router: Log: this query type needs escalation end
from anthropic import Anthropic
 
client = Anthropic()
 
EVAL_PROMPT = """You are a quality evaluator. Given a query and a response, 
determine if the response is:
1. Factually correct (to the best of your knowledge)
2. Complete (actually answers what was asked)
3. Coherent (not internally contradictory)
 
Respond with JSON only: {"passes": true/false, "issues": ["list of problems"]}"""
 
def eval_response_quality(query: str, response: str) -> dict:
    # Use a mid-tier model for evaluation — not Haiku (same capability) 
    # not Opus (too expensive for every call)
    eval_result = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"{EVAL_PROMPT}\n\nQuery: {query}\n\nResponse: {response}"
        }],
    )
    try:
        return json.loads(eval_result.content[0].text)
    except json.JSONDecodeError:
        return {"passes": False, "issues": ["Eval parse error — escalating to be safe"]}
 
def cascade_query(query: str, router: Pipeline) -> dict:
    initial_tier = route_with_classifier(query, router)
    
    if initial_tier == Tier.HAIKU:
        haiku_resp = call_model(Tier.HAIKU, query)
        eval_result = eval_response_quality(query, haiku_resp)
        
        if eval_result["passes"]:
            return {"answer": haiku_resp, "model_used": "haiku", "escalated": False}
        
        # Escalate to Sonnet
        sonnet_resp = call_model(Tier.SONNET, query, prior_attempt=haiku_resp)
        return {"answer": sonnet_resp, "model_used": "sonnet", "escalated": True,
                "escalation_reason": eval_result["issues"]}
    
    # ... handle other initial tiers

Note the evaluation model choice: using Sonnet to evaluate Haiku's responses is intentional. If you use Haiku to evaluate itself, you inherit the same blind spots.

Cost vs. Quality Curves

Before building a router, instrument your current system to understand your baseline.

import statistics
from collections import defaultdict
 
def analyze_query_distribution(query_log: list[dict]) -> dict:
    """
    query_log entries: {query, model_used, cost_usd, user_rating (1-5), tokens}
    """
    by_model = defaultdict(list)
    for entry in query_log:
        by_model[entry["model_used"]].append(entry)
    
    report = {}
    for model, entries in by_model.items():
        costs = [e["cost_usd"] for e in entries]
        ratings = [e["user_rating"] for e in entries if e.get("user_rating")]
        report[model] = {
            "query_count": len(entries),
            "pct_of_total": len(entries) / len(query_log) * 100,
            "avg_cost": statistics.mean(costs),
            "total_cost": sum(costs),
            "avg_rating": statistics.mean(ratings) if ratings else None,
            "p95_tokens": sorted([e["tokens"] for e in entries])[int(len(entries) * 0.95)],
        }
    
    return report

Run this for two weeks before building a router. You'll likely find that 60-80% of queries in most B2B apps fall in the "easy" category that Haiku can handle. That's your routing opportunity.

When the Cascade Fails

Three failure modes that will hurt you:

1. Cascading latency on escalation. If Haiku takes 1.5s, the eval takes 0.8s, and then Sonnet takes 2.5s, your p95 latency on escalated queries is 4.8s. Users notice.

# Mitigation: parallel-speculate for borderline queries
import asyncio
 
async def speculative_cascade(query: str, confidence: float) -> str:
    if confidence < 0.6:  # borderline — run both in parallel, use Sonnet if needed
        haiku_task = asyncio.create_task(call_model_async(Tier.HAIKU, query))
        sonnet_task = asyncio.create_task(call_model_async(Tier.SONNET, query))
        
        haiku_result = await haiku_task
        eval_result = eval_response_quality(query, haiku_result)
        
        if eval_result["passes"]:
            sonnet_task.cancel()
            return haiku_result
        else:
            return await sonnet_task  # already running, minimal additional latency
    
    return await call_model_async(route_with_classifier(query, router), query)

2. Distribution shift after deployment. Your router was trained on last month's queries. This month, users start asking a new type of question your router routes to Haiku incorrectly. Set up weekly accuracy sampling.

3. Eval model cost creep. If you're eval-ing 100% of Haiku responses with Sonnet, your eval cost can exceed your routing savings. Sample: run evals on 20% of Haiku responses, stratified by query type.

Debugging the Cascade

The hardest part of routing systems is debugging silent degradation — cases where queries route incorrectly but users don't complain loudly.

def routing_audit_report(query_log: list[dict], sample_size: int = 100) -> dict:
    """
    Retrospectively checks if routed queries were handled at the right tier.
    Re-runs a sample through the full eval pipeline.
    """
    haiku_queries = [q for q in query_log if q["model_used"] == "haiku"]
    sample = haiku_queries[:sample_size]  # sample recent Haiku-answered queries
    
    misrouted = []
    for entry in sample:
        eval_result = eval_response_quality(entry["query"], entry["response"])
        if not eval_result["passes"]:
            misrouted.append({
                "query": entry["query"][:100],
                "issues": eval_result["issues"],
                "date": entry["timestamp"],
            })
    
    return {
        "sample_size": len(sample),
        "misrouted_count": len(misrouted),
        "misroute_rate": len(misrouted) / len(sample),
        "examples": misrouted[:10],
    }

Run this weekly. If misroute rate climbs above 5%, retrain your classifier on the new examples.

Key Takeaways

  • Heuristic routing is the right starting point — ship something simple, measure, then add complexity.
  • Classifier-based routing with asymmetric cost weighting (escalation errors are worse than conservative routing) outperforms self-routing by a significant margin in production.
  • Parallelspeculation on borderline queries prevents the latency spike from dominating your p95 metrics.
  • Never use the same model tier to evaluate responses from that tier — use one tier up for calibrated quality signals.
  • Distribution shift is the silent killer of routing systems: schedule weekly accuracy audits on randomly sampled routed responses.
  • The business case is real — 60-80% of queries in most B2B apps can route to the cheap tier without detectable quality loss, yielding 3-5x cost reduction.