Cost and Latency Budgets
Series
Agent EngineeringAn agent with no cost ceiling is a liability, not a feature. A single runaway task — looping on a bad tool response, re-planning repeatedly on ambiguous input, or hitting a pagination bug that generates infinite LLM calls — can cost hundreds of dollars before anyone notices. Cost and latency budgets are not optional niceties; they are safety constraints that belong in the core agent loop from day one.
Budget as a First-Class Primitive
Every task execution should carry a budget object. Not a global config value — a per-task, per-session object that the agent checks before every LLM call.
from dataclasses import dataclass, field
from time import time
@dataclass
class TaskBudget:
max_tokens: int = 50_000 # total input+output tokens
max_llm_calls: int = 20 # hard cap on LLM invocations
max_wall_seconds: float = 120.0 # wall-clock deadline
max_cost_usd: float = 0.50 # dollar ceiling
# runtime accumulators
tokens_used: int = 0
llm_calls: int = 0
cost_usd: float = 0.0
started_at: float = field(default_factory=time)
def charge(self, input_tokens: int, output_tokens: int,
cost_per_1k_in: float = 0.003,
cost_per_1k_out: float = 0.015):
self.tokens_used += input_tokens + output_tokens
self.llm_calls += 1
self.cost_usd += (input_tokens / 1000) * cost_per_1k_in
self.cost_usd += (output_tokens / 1000) * cost_per_1k_out
def elapsed(self) -> float:
return time() - self.started_at
def exceeded(self) -> str | None:
if self.tokens_used > self.max_tokens:
return f"Token limit: {self.tokens_used}/{self.max_tokens}"
if self.llm_calls > self.max_llm_calls:
return f"Call limit: {self.llm_calls}/{self.max_llm_calls}"
if self.elapsed() > self.max_wall_seconds:
return f"Time limit: {self.elapsed():.1f}s/{self.max_wall_seconds}s"
if self.cost_usd > self.max_cost_usd:
return f"Cost limit: ${self.cost_usd:.3f}/${self.max_cost_usd}"
return NoneCheck budget.exceeded() before every LLM call. If it triggers, escalate — do not silently discard work.
The Budget Check Loop
The escalation handler is the critical piece most implementations skip. "Exceeded budget" without an escalation policy means the task silently fails, the user gets nothing, and you have no signal about which tasks routinely exhaust their budgets.
Model Tiering for Cost Control
The highest-leverage cost control is model selection per step. Not every step in an agent loop needs the best available model.
from enum import Enum
class StepType(Enum):
PLAN = "plan" # needs best reasoning
TOOL_PARSE = "parse" # simple extraction
SYNTHESIZE = "synth" # final answer, needs quality
REFLECT = "reflect" # intermediate check
MODEL_TIER = {
StepType.PLAN: "claude-opus-4-5",
StepType.TOOL_PARSE: "claude-haiku-4-5",
StepType.SYNTHESIZE: "claude-opus-4-5",
StepType.REFLECT: "claude-haiku-4-5",
}
# Rough cost ratio: Opus = 1.0x, Haiku = 0.025x
# Routing tool parsing and reflection to Haiku cuts 60-70% of token cost
# on typical 10-step ReAct loops.
def get_model(step: StepType, budget: TaskBudget) -> str:
"""Downgrade model if budget is running low."""
base = MODEL_TIER[step]
cost_ratio = budget.cost_usd / budget.max_cost_usd
if cost_ratio > 0.7 and base == "claude-opus-4-5":
# Degrade gracefully rather than hard-fail
return "claude-haiku-4-5"
return baseAdaptive downgrading is a pragmatic middle ground: when you have used 70% of the cost budget, switch to cheaper models for the remaining steps rather than aborting. The result quality may drop slightly, but partial results are usually better than no results.
Latency Budgets and Timeout Strategy
Cost budgets protect your wallet. Latency budgets protect user experience. They are separate constraints that need separate enforcement.
import asyncio
from typing import TypeVar, Callable, Awaitable
T = TypeVar("T")
async def with_timeout(
coro: Awaitable[T],
timeout_sec: float,
fallback: T,
label: str = "agent_call",
) -> T:
"""Run a coroutine with a hard timeout; return fallback on expiry."""
try:
return await asyncio.wait_for(coro, timeout=timeout_sec)
except asyncio.TimeoutError:
# Emit a metric here in production
print(f"[TIMEOUT] {label} exceeded {timeout_sec}s — returning fallback")
return fallback
# Usage in the agent loop
async def safe_llm_call(prompt: str, budget: TaskBudget) -> str:
remaining = budget.max_wall_seconds - budget.elapsed()
if remaining <= 0:
return "[BUDGET EXCEEDED: time]"
return await with_timeout(
coro=llm_call(prompt),
timeout_sec=min(remaining, 30.0), # per-call cap too
fallback="[TIMEOUT]",
label="llm_call",
)Set both a per-call timeout and a task-level wall-clock deadline. A single slow LLM call should not burn the entire task budget.
Monitoring and Alerting
Budgets are only useful if you know when they are being hit. The minimum monitoring setup:
import dataclasses
def emit_budget_metrics(task_id: str, budget: TaskBudget, outcome: str):
"""Emit to your metrics system (Datadog, Prometheus, etc.)"""
metrics = {
"task_id": task_id,
"outcome": outcome, # "success", "budget_exceeded", "error"
"tokens_used": budget.tokens_used,
"llm_calls": budget.llm_calls,
"cost_usd": round(budget.cost_usd, 4),
"wall_seconds": round(budget.elapsed(), 2),
"token_utilization": budget.tokens_used / budget.max_tokens,
"cost_utilization": budget.cost_usd / budget.max_cost_usd,
}
# Replace with your metrics client
print(f"[METRICS] {metrics}")Alert on: cost_utilization > 0.9 (tasks regularly hitting ceiling — ceiling is wrong), outcome == "budget_exceeded" rate > 5% (budgets are too tight or tasks are poorly scoped), average cost_usd drifting upward over time (model behavior or task complexity change).
Key Takeaways
- Treat cost and latency budgets as first-class runtime constraints, not global config values — attach them to individual task sessions.
- Check the budget before every LLM call; do not let the loop discover an overrun only after the damage is done.
- Model tiering is the highest-leverage cost control: route parsing and reflection steps to cheaper models, reserve expensive models for planning and synthesis.
- Adaptive downgrading (switch to a cheaper model when 70% of cost budget is used) is better than hard-failing — partial quality beats no result.
- Set both per-call timeouts and task-level wall-clock deadlines; they are different failure modes requiring separate enforcement.
- Monitor
budget_exceededrate and cost utilization trends — budgets that are never hit are too loose; budgets hit >5% of the time are miscalibrated.