Agent Engineering

Tool Design for Autonomy

Ravinder·February 8, 2025·6 min read

AgentsAILLMTool UseAPI Design

Series

Agent Engineering

Part 2 of 10

← Part 1

The Agent Loop and Halting Conditions

Part 3 →

Memory: Short, Long, Episodic

The model is not the bottleneck. The tools are. A capable LLM paired with a poorly designed tool set will produce an agent that hallucinates parameters, misinterprets results, retries destructive operations, and generally behaves unpredictably in ways that are hard to trace back to a root cause.

Tool design for autonomous systems is a discipline in its own right—closer to API design than to prompt engineering—and most teams treat it as an afterthought. That is a mistake you pay for in debugging hours and production incidents.

The Contract a Tool Must Fulfill

Every tool exposed to an agent is a contract. The agent reads the contract (the schema and description), decides to invoke the tool, and expects the contract to hold. When it does not, the agent's reasoning goes sideways in ways that compound over multiple iterations.

A well-formed tool contract has four parts:

graph TD Tool[Tool Contract] --> Name[Unambiguous Name\nverb + noun, no overloading] Tool --> Schema[Typed Schema\nrequired fields, constraints, examples] Tool --> SideEffect[Side Effect Declaration\nread-only vs mutating vs destructive] Tool --> ErrorContract[Error Contract\nnominal errors, fatal errors, retryability]

Neglect any one of these and you have introduced a latent defect that will manifest unpredictably depending on the agent's reasoning trace.

Naming: Verb-Noun, No Overloading

Tool names are prompts. The model reads them to decide which tool fits the current intent. Ambiguous names create ambiguous decisions.

Bad names:

process — process what? how?
data_tool — entirely opaque
handle_request — could be anything
get — overloaded to the point of uselessness

Good names:

search_web — unambiguous action and domain
read_file — clear, single responsibility
send_email — obvious side effect
delete_record — the word "delete" is a signal to the model that this is destructive

Use consistent verb vocabulary across your tool set. Pick one word for retrieval (get_, fetch_, read_), one for mutation (update_, write_, set_), one for creation (create_, add_, insert_), and stick to it. The model will learn the pattern within a session.

Schemas: Typed, Constrained, Exemplified

JSON Schema is your primary communication channel with the model. Write it as if the model has never seen your domain before—because for every new task, it effectively has not.

# Weak schema — too loose, model must guess
weak_schema = {
    "name": "search_records",
    "description": "Search for records.",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {"type": "string"},
            "options": {"type": "object"}
        }
    }
}
 
# Strong schema — constrained, documented, exemplified
strong_schema = {
    "name": "search_customer_records",
    "description": (
        "Search the customer database by name, email, or account ID. "
        "Returns at most `limit` matching records ordered by relevance. "
        "Use this when you need to look up a specific customer before "
        "reading or modifying their data."
    ),
    "parameters": {
        "type": "object",
        "required": ["query"],
        "properties": {
            "query": {
                "type": "string",
                "description": "Search term: customer name, email address, or account ID (e.g. 'ACC-1234').",
                "minLength": 2,
                "maxLength": 200,
            },
            "limit": {
                "type": "integer",
                "description": "Maximum number of results to return. Defaults to 10, max 50.",
                "default": 10,
                "minimum": 1,
                "maximum": 50,
            },
            "include_inactive": {
                "type": "boolean",
                "description": "If true, include deactivated accounts in results. Default false.",
                "default": False,
            },
        },
        "additionalProperties": False,
    },
}

The description fields are not just documentation. They are part of the reasoning input. Write them as instructions to an intelligent but uninformed junior engineer, because that is essentially what you are doing.

Side Effects: Classify Every Tool

Agents retry. When a tool call times out or returns an ambiguous error, the model will attempt to call it again. If that tool has side effects—especially destructive ones—retries are dangerous.

Classify every tool by its side-effect profile:

Class	Description	Retry Safe?
Read	Returns data, no mutation	Yes
Idempotent Write	Same call = same result	Yes
Non-idempotent Write	Creates new state each call	No
Destructive	Deletes or irreversibly mutates	Never auto-retry

Encode this classification in your tool registry and enforce it in your loop:

from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any
 
class SideEffectClass(Enum):
    READ = "read"
    IDEMPOTENT_WRITE = "idempotent_write"
    NON_IDEMPOTENT_WRITE = "non_idempotent_write"
    DESTRUCTIVE = "destructive"
 
@dataclass
class Tool:
    name: str
    description: str
    schema: dict
    fn: Callable
    side_effect: SideEffectClass
    max_retries: int = field(init=False)
 
    def __post_init__(self):
        self.max_retries = {
            SideEffectClass.READ: 3,
            SideEffectClass.IDEMPOTENT_WRITE: 2,
            SideEffectClass.NON_IDEMPOTENT_WRITE: 0,
            SideEffectClass.DESTRUCTIVE: 0,
        }[self.side_effect]

Error Contracts: What the Model Should Do Next

When a tool fails, the error message is the next observation. A vague error sends the agent off in the wrong direction. A structured error tells the agent exactly how to recover—or whether to stop.

Design three error categories:

Nominal errors — Expected failures in the normal operating envelope. The agent should handle these autonomously.

{
  "error": "NOT_FOUND",
  "message": "No customer found matching 'john@example.com'",
  "retryable": false,
  "suggestion": "Try searching by name or account ID instead."
}

Tool errors — The tool itself failed due to a transient condition. The agent may retry.

{
  "error": "SERVICE_UNAVAILABLE",
  "message": "Customer database is temporarily unavailable.",
  "retryable": true,
  "retry_after_seconds": 5
}

Fatal errors — Stop the agent and escalate to a human.

{
  "error": "PERMISSION_DENIED",
  "message": "Agent does not have write access to production customer records.",
  "retryable": false,
  "escalate": true
}

The escalate flag is particularly important. When the agent sees it, the loop should terminate and hand off to a human-in-the-loop checkpoint rather than attempting to find a workaround.

The Minimal Tool Surface Principle

More tools is not better. Each additional tool increases the probability that the model picks the wrong one, especially when tool names or descriptions overlap. Define the minimal set of tools that covers the task space, and resist the temptation to expose every capability of your underlying APIs.

If you find yourself with more than 15–20 tools in a single agent's registry, the agent probably needs to be decomposed into specialized sub-agents—each with a focused tool set that fits its domain.

graph LR Many[30 tools\nin one agent] -->|model confusion\nhigh| Bad[Poor\nreliability] Few[5-10 tools\nper specialist] -->|focused\nclear intent| Good[High\nreliability] Few2[5-10 tools\nper specialist] --> Orch[Orchestrator\nroutes to specialist] Good --> Orch

Key Takeaways

Tool names are prompts—use unambiguous verb-noun pairs and maintain consistent verb vocabulary across your entire tool set.
Schema descriptions are reasoning inputs, not documentation; write them as instructions to an intelligent but uninformed colleague.
Classify every tool by its side-effect profile and enforce retry limits accordingly—auto-retrying a destructive tool is a production incident.
Design structured error responses with explicit retryable and escalate flags so the agent knows what to do next without guessing.
Minimize the tool surface per agent; more than 15–20 tools signals you need specialized sub-agents with focused registries.
Validate all tool inputs against the schema before execution—never pass raw model output directly to a function or external API.

Series

Agent Engineering

Part 2 of 10

← Part 1

The Agent Loop and Halting Conditions

Part 3 →

Memory: Short, Long, Episodic