Skip to main content
Agent Engineering

Cost and Latency Budgets

Ravinder··5 min read
AgentsAILLMCostLatencyMonitoring
Share:
Cost and Latency Budgets

An agent with no cost ceiling is a liability, not a feature. A single runaway task — looping on a bad tool response, re-planning repeatedly on ambiguous input, or hitting a pagination bug that generates infinite LLM calls — can cost hundreds of dollars before anyone notices. Cost and latency budgets are not optional niceties; they are safety constraints that belong in the core agent loop from day one.

Budget as a First-Class Primitive

Every task execution should carry a budget object. Not a global config value — a per-task, per-session object that the agent checks before every LLM call.

from dataclasses import dataclass, field
from time import time
 
@dataclass
class TaskBudget:
    max_tokens: int = 50_000          # total input+output tokens
    max_llm_calls: int = 20           # hard cap on LLM invocations
    max_wall_seconds: float = 120.0   # wall-clock deadline
    max_cost_usd: float = 0.50        # dollar ceiling
 
    # runtime accumulators
    tokens_used: int = 0
    llm_calls: int = 0
    cost_usd: float = 0.0
    started_at: float = field(default_factory=time)
 
    def charge(self, input_tokens: int, output_tokens: int,
               cost_per_1k_in: float = 0.003,
               cost_per_1k_out: float = 0.015):
        self.tokens_used += input_tokens + output_tokens
        self.llm_calls += 1
        self.cost_usd += (input_tokens / 1000) * cost_per_1k_in
        self.cost_usd += (output_tokens / 1000) * cost_per_1k_out
 
    def elapsed(self) -> float:
        return time() - self.started_at
 
    def exceeded(self) -> str | None:
        if self.tokens_used > self.max_tokens:
            return f"Token limit: {self.tokens_used}/{self.max_tokens}"
        if self.llm_calls > self.max_llm_calls:
            return f"Call limit: {self.llm_calls}/{self.max_llm_calls}"
        if self.elapsed() > self.max_wall_seconds:
            return f"Time limit: {self.elapsed():.1f}s/{self.max_wall_seconds}s"
        if self.cost_usd > self.max_cost_usd:
            return f"Cost limit: ${self.cost_usd:.3f}/${self.max_cost_usd}"
        return None

Check budget.exceeded() before every LLM call. If it triggers, escalate — do not silently discard work.

The Budget Check Loop

flowchart TD A[Receive Task + Budget] --> B[Check Budget] B -- exceeded --> C[Escalation Handler] B -- ok --> D[LLM Call] D --> E[Charge Budget] E --> F{Task Complete?} F -- yes --> G[Return Result] F -- no --> B C --> H{Escalation Policy} H -- retry-smaller --> I[Reduce Scope & Retry] H -- human-review --> J[Queue for Human] H -- fail-fast --> K[Return Partial + Error]

The escalation handler is the critical piece most implementations skip. "Exceeded budget" without an escalation policy means the task silently fails, the user gets nothing, and you have no signal about which tasks routinely exhaust their budgets.

Model Tiering for Cost Control

The highest-leverage cost control is model selection per step. Not every step in an agent loop needs the best available model.

from enum import Enum
 
class StepType(Enum):
    PLAN = "plan"           # needs best reasoning
    TOOL_PARSE = "parse"    # simple extraction
    SYNTHESIZE = "synth"    # final answer, needs quality
    REFLECT = "reflect"     # intermediate check
 
MODEL_TIER = {
    StepType.PLAN:       "claude-opus-4-5",
    StepType.TOOL_PARSE: "claude-haiku-4-5",
    StepType.SYNTHESIZE: "claude-opus-4-5",
    StepType.REFLECT:    "claude-haiku-4-5",
}
 
# Rough cost ratio: Opus = 1.0x, Haiku = 0.025x
# Routing tool parsing and reflection to Haiku cuts 60-70% of token cost
# on typical 10-step ReAct loops.
 
def get_model(step: StepType, budget: TaskBudget) -> str:
    """Downgrade model if budget is running low."""
    base = MODEL_TIER[step]
    cost_ratio = budget.cost_usd / budget.max_cost_usd
    if cost_ratio > 0.7 and base == "claude-opus-4-5":
        # Degrade gracefully rather than hard-fail
        return "claude-haiku-4-5"
    return base

Adaptive downgrading is a pragmatic middle ground: when you have used 70% of the cost budget, switch to cheaper models for the remaining steps rather than aborting. The result quality may drop slightly, but partial results are usually better than no results.

Latency Budgets and Timeout Strategy

Cost budgets protect your wallet. Latency budgets protect user experience. They are separate constraints that need separate enforcement.

import asyncio
from typing import TypeVar, Callable, Awaitable
 
T = TypeVar("T")
 
async def with_timeout(
    coro: Awaitable[T],
    timeout_sec: float,
    fallback: T,
    label: str = "agent_call",
) -> T:
    """Run a coroutine with a hard timeout; return fallback on expiry."""
    try:
        return await asyncio.wait_for(coro, timeout=timeout_sec)
    except asyncio.TimeoutError:
        # Emit a metric here in production
        print(f"[TIMEOUT] {label} exceeded {timeout_sec}s — returning fallback")
        return fallback
 
# Usage in the agent loop
async def safe_llm_call(prompt: str, budget: TaskBudget) -> str:
    remaining = budget.max_wall_seconds - budget.elapsed()
    if remaining <= 0:
        return "[BUDGET EXCEEDED: time]"
    return await with_timeout(
        coro=llm_call(prompt),
        timeout_sec=min(remaining, 30.0),   # per-call cap too
        fallback="[TIMEOUT]",
        label="llm_call",
    )

Set both a per-call timeout and a task-level wall-clock deadline. A single slow LLM call should not burn the entire task budget.

Monitoring and Alerting

Budgets are only useful if you know when they are being hit. The minimum monitoring setup:

import dataclasses
 
def emit_budget_metrics(task_id: str, budget: TaskBudget, outcome: str):
    """Emit to your metrics system (Datadog, Prometheus, etc.)"""
    metrics = {
        "task_id": task_id,
        "outcome": outcome,            # "success", "budget_exceeded", "error"
        "tokens_used": budget.tokens_used,
        "llm_calls": budget.llm_calls,
        "cost_usd": round(budget.cost_usd, 4),
        "wall_seconds": round(budget.elapsed(), 2),
        "token_utilization": budget.tokens_used / budget.max_tokens,
        "cost_utilization": budget.cost_usd / budget.max_cost_usd,
    }
    # Replace with your metrics client
    print(f"[METRICS] {metrics}")

Alert on: cost_utilization > 0.9 (tasks regularly hitting ceiling — ceiling is wrong), outcome == "budget_exceeded" rate > 5% (budgets are too tight or tasks are poorly scoped), average cost_usd drifting upward over time (model behavior or task complexity change).

Key Takeaways

  • Treat cost and latency budgets as first-class runtime constraints, not global config values — attach them to individual task sessions.
  • Check the budget before every LLM call; do not let the loop discover an overrun only after the damage is done.
  • Model tiering is the highest-leverage cost control: route parsing and reflection steps to cheaper models, reserve expensive models for planning and synthesis.
  • Adaptive downgrading (switch to a cheaper model when 70% of cost budget is used) is better than hard-failing — partial quality beats no result.
  • Set both per-call timeouts and task-level wall-clock deadlines; they are different failure modes requiring separate enforcement.
  • Monitor budget_exceeded rate and cost utilization trends — budgets that are never hit are too loose; budgets hit >5% of the time are miscalibrated.
Share: