Agent Engineering

Evals for Agentic Workflows

Ravinder·March 15, 2025·6 min read

AgentsAILLMEvalsTestingBenchmarking

Series

Agent Engineering

Part 7 of 10

← Part 6

Cost and Latency Budgets

Part 8 →

Human-in-the-Loop Checkpoints

Evaluating agents is not like evaluating classifiers. A classifier either labels correctly or it does not. An agent can reach the right final answer via a wildly inefficient path, or fail on step 3 and produce a plausible-looking answer that is completely wrong. Final-output metrics alone will lie to you. You need evals that measure the journey, not just the destination.

Three Levels of Agentic Evals

Think of agent evals in three layers, each catching different failure classes:

flowchart LR A[Task Success\nDid it complete?] --> B[Trajectory Quality\nDid it take the right path?] B --> C[Regression Set\nDoes it still work after changes?] A -- "Catches"\n--> A1[Catastrophic failures] B -- "Catches"\n--> B1[Inefficiency\nHallucinated steps\nWrong tool use] C -- "Catches"\n--> C1[Regressions from\nmodel/prompt changes]

Most teams only implement the first layer. Teams that operate agents in production need all three.

Task Success Evals

The baseline: did the agent complete the task and produce a correct result? For tasks with deterministic outputs, this is straightforward.

from dataclasses import dataclass
from typing import Callable
 
@dataclass
class EvalCase:
    task_id: str
    input: str
    expected_output: str
    grade_fn: Callable[[str, str], float]  # returns 0.0–1.0
 
def exact_match(expected: str, actual: str) -> float:
    return 1.0 if expected.strip() == actual.strip() else 0.0
 
def substring_match(expected: str, actual: str) -> float:
    return 1.0 if expected.strip().lower() in actual.strip().lower() else 0.0
 
def llm_grade(expected: str, actual: str, llm_client=None) -> float:
    """Use an LLM judge for open-ended tasks."""
    prompt = f"""
    Expected answer: {expected}
    Agent answer: {actual}
    Score 0-10 for correctness. Reply with only the integer.
    """
    resp = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    try:
        return int(resp.choices[0].message.content.strip()) / 10
    except ValueError:
        return 0.0
 
def run_eval_suite(
    agent_fn: Callable[[str], str],
    cases: list[EvalCase],
) -> dict:
    results = []
    for case in cases:
        actual = agent_fn(case.input)
        score = case.grade_fn(case.expected_output, actual)
        results.append({"id": case.task_id, "score": score, "actual": actual})
    avg = sum(r["score"] for r in results) / len(results)
    return {"mean_score": avg, "results": results}

LLM-as-judge is necessary for open-ended tasks but adds cost and latency. Use exact/substring match wherever possible and reserve LLM grading for the cases that genuinely require semantic understanding.

Trajectory Evals

A trajectory is the sequence of tool calls and decisions the agent made to reach its output. A good trajectory eval checks: were the right tools called? In the right order? With the right arguments? Were any steps redundant?

from dataclasses import dataclass
 
@dataclass
class TrajectoryStep:
    tool: str
    args: dict
    result: str
 
@dataclass
class TrajectoryCase:
    task_id: str
    input: str
    expected_steps: list[dict]   # [{"tool": "search_web", "arg_contains": "python"}]
 
def eval_trajectory(
    actual_steps: list[TrajectoryStep],
    expected_steps: list[dict],
) -> dict:
    """Check that expected tool calls appear in the right order."""
    score = 0
    details = []
    expected_idx = 0
    for step in actual_steps:
        if expected_idx >= len(expected_steps):
            break
        exp = expected_steps[expected_idx]
        tool_match = step.tool == exp["tool"]
        arg_match = all(
            v in str(step.args)
            for v in exp.get("arg_contains", [])
        )
        if tool_match and arg_match:
            score += 1
            expected_idx += 1
            details.append({"step": step.tool, "matched": True})
        else:
            details.append({"step": step.tool, "matched": False,
                            "expected": exp["tool"]})
 
    return {
        "score": score / len(expected_steps) if expected_steps else 1.0,
        "steps_matched": score,
        "total_expected": len(expected_steps),
        "extra_steps": len(actual_steps) - len(expected_steps),
        "details": details,
    }

Track extra_steps carefully. An agent that takes 12 steps when the reference trajectory takes 4 is burning 3× the cost. Trajectory efficiency is a cost metric masquerading as a quality metric.

Building a Regression Set

A regression set is a fixed collection of eval cases that you run on every code or prompt change. It is your CI for agent behavior.

import json
from pathlib import Path
 
class RegressionSet:
    def __init__(self, path: str):
        self.path = Path(path)
        self.cases: list[EvalCase] = []
 
    def add(self, case: EvalCase):
        self.cases.append(case)
        self._persist()
 
    def _persist(self):
        """Save regression set to disk for version control."""
        data = [
            {
                "task_id": c.task_id,
                "input": c.input,
                "expected_output": c.expected_output,
                "grade_fn": c.grade_fn.__name__,
            }
            for c in self.cases
        ]
        self.path.write_text(json.dumps(data, indent=2))
 
    def run(self, agent_fn: Callable[[str], str]) -> dict:
        results = run_eval_suite(agent_fn, self.cases)
        print(f"Regression score: {results['mean_score']:.2%}")
        return results

Populate the regression set from two sources: (1) curated golden examples representing critical task types, and (2) real production failures — whenever an agent produces a wrong answer in production and you catch it, add it to the regression set. This makes your eval set adversarial over time.

Eval-Gated Deploys

Evals are only useful if they block bad deploys. Wire the regression set into CI:

# .github/workflows/agent-evals.yml
name: Agent Regression Evals
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run regression evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python -m pytest tests/evals/ -v --tb=short
      - name: Check score threshold
        run: python scripts/check_eval_threshold.py --min-score 0.85

The threshold (0.85 here) is a team decision. Start permissive and tighten as your eval set matures and your agent stabilizes. A score drop of more than 5 points from the baseline should block the merge.

Common Eval Anti-Patterns

Evaluating only happy-path inputs: Your regression set will be dominated by easy cases. Deliberately add adversarial inputs — ambiguous instructions, malformed tool responses, conflicting constraints.

LLM judge without calibration: An LLM judge that gives 8/10 to wrong answers is worse than no eval. Calibrate your judge against human labels on a sample set before trusting it.

No baseline tracking: Running evals once is table stakes. Track scores over time. A slow drift from 0.92 to 0.84 over six model updates is invisible without a time series.

Key Takeaways

Agentic evals require three layers: task success (did it finish?), trajectory quality (did it take the right path?), and regression sets (does it still work after changes?).
LLM-as-judge is necessary for open-ended tasks but must be calibrated against human labels before you trust its scores.
Track extra_steps in trajectory evals — path inefficiency is a cost metric disguised as a quality metric.
Populate your regression set from production failures, not just curated golden examples, to make it adversarial over time.
Gate deploys on eval score thresholds in CI; a score drop of more than 5 points from baseline should block merges.
Deliberately include adversarial inputs — ambiguous instructions, malformed tool responses — or your eval set will only catch regressions on easy cases.

Series

Agent Engineering

Part 7 of 10

← Part 6

Cost and Latency Budgets

Part 8 →

Human-in-the-Loop Checkpoints