Agent Engineering

Production Rollout Patterns

Ravinder·April 5, 2025·7 min read

AgentsAILLMProductionDeploymentDevOps

Series

Agent Engineering

Part 10 of 10

← Part 9

Sandboxing and Blast Radius

End of series

Getting an agent to work in a demo is a skill. Getting it to production and keeping it working across model updates, prompt changes, and traffic growth is a different discipline entirely. Most teams treat "deploy the agent" as a one-time event. The teams that operate agents successfully treat it as a continuous process with the same rigor they apply to any production service — canary traffic, automated quality gates, and documented runbooks for when things go wrong.

The Rollout Lifecycle

Agent rollouts are riskier than typical software deploys for one reason: the failure mode is often wrong output rather than an error or crash. A broken API returns a 500. A regressed agent returns a plausible-looking wrong answer. You cannot rely on error rates alone to detect problems.

flowchart LR A[Code / Prompt Change] --> B[Run Eval Suite\non Regression Set] B -- score below threshold --> C[Block Merge] B -- score ok --> D[Merge to main] D --> E[Deploy to Canary\n5% traffic] E --> F[Monitor Production Metrics\n24–48h] F -- metrics degraded --> G[Rollback Canary] F -- metrics healthy --> H[Ramp to 50%] H --> I[Monitor 24h] I -- healthy --> J[Full Rollout 100%] I -- degraded --> G

Three gates: eval gate before merge, canary health gate before ramp, full traffic gate before 100%. Skip any of them and you are flying blind.

Eval-Gated Deploys

The eval gate (covered in depth in post 7) must run in CI against every pull request that touches agent code, prompts, or model configuration. The gate condition is a minimum score threshold — not a fixed pass/fail.

# scripts/check_eval_threshold.py
import argparse
import json
import sys
 
def check_threshold(results_path: str, min_score: float, baseline_path: str = None):
    with open(results_path) as f:
        results = json.load(f)
    score = results["mean_score"]
 
    print(f"Eval score: {score:.4f} (threshold: {min_score:.4f})")
 
    if score < min_score:
        print(f"FAIL: Score {score:.4f} below minimum {min_score:.4f}")
        sys.exit(1)
 
    if baseline_path:
        with open(baseline_path) as f:
            baseline = json.load(f)
        baseline_score = baseline["mean_score"]
        regression = baseline_score - score
        if regression > 0.05:   # 5-point regression
            print(f"FAIL: Regression of {regression:.4f} from baseline {baseline_score:.4f}")
            sys.exit(1)
        print(f"Regression check passed (delta: {-regression:+.4f})")
 
    print("PASS")
 
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--min-score", type=float, required=True)
    parser.add_argument("--baseline", default=None)
    args = parser.parse_args()
    check_threshold(args.results, args.min_score, args.baseline)

Store the baseline eval results as a file committed to the repo. Each successful deploy updates the baseline. This catches both absolute quality failures (below minimum) and relative regressions (worse than the last known-good deploy).

Canary Traffic Routing

A canary deploy sends a fraction of real production traffic to the new agent version while the old version continues serving the majority. For agents, this requires a traffic splitter that can route at the session or request level.

import hashlib
import random
from typing import Callable
 
class CanaryRouter:
    def __init__(
        self,
        canary_pct: float,                   # 0.0–1.0
        stable_agent: Callable,
        canary_agent: Callable,
        log_fn: Callable = print,
    ):
        self.canary_pct = canary_pct
        self.stable = stable_agent
        self.canary = canary_agent
        self.log = log_fn
 
    def route(self, session_id: str, task: str) -> str:
        # Deterministic routing by session so a user always hits the same version
        bucket = int(hashlib.md5(session_id.encode()).hexdigest(), 16) % 100
        use_canary = bucket < (self.canary_pct * 100)
        version = "canary" if use_canary else "stable"
        self.log({"event": "traffic_route", "session": session_id,
                  "version": version, "canary_pct": self.canary_pct})
        agent = self.canary if use_canary else self.stable
        return agent(task)

Deterministic routing by session ID ensures a consistent experience per user — the same user does not see the old agent on one request and the new agent on the next.

Production Metrics for Agent Health

Standard error-rate and latency dashboards are necessary but not sufficient. You need agent-specific metrics:

from dataclasses import dataclass, field
from collections import defaultdict
import time
 
@dataclass
class AgentMetrics:
    """Rolling metrics window for one agent version."""
    version: str
    window_seconds: float = 3600.0
    _events: list[dict] = field(default_factory=list)
 
    def record(self, event: str, value: float = 1.0, tags: dict = None):
        self._events.append({
            "ts": time.time(),
            "event": event,
            "value": value,
            "tags": tags or {},
        })
        self._prune()
 
    def _prune(self):
        cutoff = time.time() - self.window_seconds
        self._events = [e for e in self._events if e["ts"] > cutoff]
 
    def rate(self, event: str) -> float:
        count = sum(1 for e in self._events if e["event"] == event)
        return count / (self.window_seconds / 3600)   # events per hour
 
    def mean(self, event: str) -> float:
        vals = [e["value"] for e in self._events if e["event"] == event]
        return sum(vals) / len(vals) if vals else 0.0
 
# Key metrics to track per version
AGENT_METRICS_SPEC = {
    "task_success_rate":      "% of tasks completing without error or budget overrun",
    "mean_llm_calls":         "average LLM calls per completed task (efficiency proxy)",
    "mean_cost_usd":          "average cost per task (budget health)",
    "checkpoint_trigger_rate":"% of tasks hitting human-in-the-loop checkpoints",
    "rollback_rate":          "% of destructive actions triggering rollback",
    "p95_latency_sec":        "95th percentile wall-clock time per task",
}

Compare these metrics between the stable and canary versions. A canary with a lower task success rate or higher mean cost than stable is a regression, even if the eval suite passed.

Runbooks

A runbook is the answer to: "It is 2am, an alert fired, you are the on-call, what do you do?" For agents, you need runbooks for at least these scenarios:

Runbook: Agent Task Success Rate Drops Below 80%

Check recent deploys — was there a model or prompt change in the last 24h? Roll back if yes.
Check LLM provider status pages — is this a provider outage?
Pull the last 50 failed task logs. Is there a common failure pattern (specific tool, specific task type)?
If isolated to canary: roll canary traffic to 0%, page the agent team.
If on stable: scale down agent workers, redirect tasks to a fallback (simpler rule-based system or human queue).

# Minimal rollback script — wire to your deploy system
import subprocess
 
def rollback_canary(deploy_system: str, service: str, previous_version: str):
    """Emergency rollback to previous stable version."""
    print(f"Rolling back {service} canary to {previous_version}")
    # Example for Kubernetes
    result = subprocess.run(
        ["kubectl", "set", "image",
         f"deployment/{service}-canary",
         f"{service}={previous_version}"],
        capture_output=True, text=True,
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rollback failed: {result.stderr}")
    print(f"Rollback complete: {result.stdout}")
    # Log the rollback event for postmortem
    log_incident(service, "canary_rollback", previous_version)

Runbooks should be in the repo, not in someone's head or a wiki that goes stale. Treat them as code: review them, update them when the system changes, test the rollback script in staging.

Model Update Cadence

LLM providers update models continuously. A model that worked perfectly in December may behave differently in March. Your eval-gated deploy pipeline handles code and prompt changes — but you also need a process for model updates.

The discipline: treat every model update (e.g., gpt-4o-2024-11-20 → gpt-4o-2025-01-31) as a canary deploy. Run the eval suite against the new model version, compare metrics to the current model, then canary-route 5% of traffic to the new version before full adoption.

MODEL_REGISTRY = {
    "planning":    {"model": "claude-opus-4-5",  "version": "20240229"},
    "parsing":     {"model": "claude-haiku-4-5", "version": "20240307"},
    "synthesis":   {"model": "claude-opus-4-5",  "version": "20240229"},
}
 
def get_model(step: str, canary: bool = False) -> str:
    entry = MODEL_REGISTRY[step]
    model = entry["model"]
    # In canary mode, add "-canary" suffix resolved to the new version
    if canary and step in CANARY_MODEL_OVERRIDES:
        model = CANARY_MODEL_OVERRIDES[step]
    return model

Never hard-code model strings throughout the codebase. Centralizing model selection in a registry means a model update is a one-line change, not a grep-and-replace across 30 files.

Key Takeaways

Agent rollouts need three gates: eval gate before merge, canary health gate before traffic ramp, full traffic gate before 100% — skip any one and you are relying on user complaints to detect regressions.
Canary routing should be deterministic by session ID so individual users always hit the same version during a rollout window.
Track agent-specific metrics (task success rate, mean LLM calls, cost per task) alongside standard latency and error rates — wrong answers do not show up in error rates.
Store eval baseline results in the repo and gate on regression from the previous known-good deploy, not just an absolute threshold.
Runbooks belong in the repo alongside the code they document — treat them as production artifacts, not optional documentation.
Treat every model version update as a canary deploy: run evals against the new version, compare to current, then route incrementally.

Series

Agent Engineering

Part 10 of 10

← Part 9

Sandboxing and Blast Radius

End of series