Tool Design for Autonomy
Series
Agent EngineeringThe model is not the bottleneck. The tools are. A capable LLM paired with a poorly designed tool set will produce an agent that hallucinates parameters, misinterprets results, retries destructive operations, and generally behaves unpredictably in ways that are hard to trace back to a root cause.
Tool design for autonomous systems is a discipline in its own right—closer to API design than to prompt engineering—and most teams treat it as an afterthought. That is a mistake you pay for in debugging hours and production incidents.
The Contract a Tool Must Fulfill
Every tool exposed to an agent is a contract. The agent reads the contract (the schema and description), decides to invoke the tool, and expects the contract to hold. When it does not, the agent's reasoning goes sideways in ways that compound over multiple iterations.
A well-formed tool contract has four parts:
Neglect any one of these and you have introduced a latent defect that will manifest unpredictably depending on the agent's reasoning trace.
Naming: Verb-Noun, No Overloading
Tool names are prompts. The model reads them to decide which tool fits the current intent. Ambiguous names create ambiguous decisions.
Bad names:
process— process what? how?data_tool— entirely opaquehandle_request— could be anythingget— overloaded to the point of uselessness
Good names:
search_web— unambiguous action and domainread_file— clear, single responsibilitysend_email— obvious side effectdelete_record— the word "delete" is a signal to the model that this is destructive
Use consistent verb vocabulary across your tool set. Pick one word for retrieval (get_, fetch_, read_), one for mutation (update_, write_, set_), one for creation (create_, add_, insert_), and stick to it. The model will learn the pattern within a session.
Schemas: Typed, Constrained, Exemplified
JSON Schema is your primary communication channel with the model. Write it as if the model has never seen your domain before—because for every new task, it effectively has not.
# Weak schema — too loose, model must guess
weak_schema = {
"name": "search_records",
"description": "Search for records.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"options": {"type": "object"}
}
}
}
# Strong schema — constrained, documented, exemplified
strong_schema = {
"name": "search_customer_records",
"description": (
"Search the customer database by name, email, or account ID. "
"Returns at most `limit` matching records ordered by relevance. "
"Use this when you need to look up a specific customer before "
"reading or modifying their data."
),
"parameters": {
"type": "object",
"required": ["query"],
"properties": {
"query": {
"type": "string",
"description": "Search term: customer name, email address, or account ID (e.g. 'ACC-1234').",
"minLength": 2,
"maxLength": 200,
},
"limit": {
"type": "integer",
"description": "Maximum number of results to return. Defaults to 10, max 50.",
"default": 10,
"minimum": 1,
"maximum": 50,
},
"include_inactive": {
"type": "boolean",
"description": "If true, include deactivated accounts in results. Default false.",
"default": False,
},
},
"additionalProperties": False,
},
}The description fields are not just documentation. They are part of the reasoning input. Write them as instructions to an intelligent but uninformed junior engineer, because that is essentially what you are doing.
Side Effects: Classify Every Tool
Agents retry. When a tool call times out or returns an ambiguous error, the model will attempt to call it again. If that tool has side effects—especially destructive ones—retries are dangerous.
Classify every tool by its side-effect profile:
| Class | Description | Retry Safe? |
|---|---|---|
| Read | Returns data, no mutation | Yes |
| Idempotent Write | Same call = same result | Yes |
| Non-idempotent Write | Creates new state each call | No |
| Destructive | Deletes or irreversibly mutates | Never auto-retry |
Encode this classification in your tool registry and enforce it in your loop:
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any
class SideEffectClass(Enum):
READ = "read"
IDEMPOTENT_WRITE = "idempotent_write"
NON_IDEMPOTENT_WRITE = "non_idempotent_write"
DESTRUCTIVE = "destructive"
@dataclass
class Tool:
name: str
description: str
schema: dict
fn: Callable
side_effect: SideEffectClass
max_retries: int = field(init=False)
def __post_init__(self):
self.max_retries = {
SideEffectClass.READ: 3,
SideEffectClass.IDEMPOTENT_WRITE: 2,
SideEffectClass.NON_IDEMPOTENT_WRITE: 0,
SideEffectClass.DESTRUCTIVE: 0,
}[self.side_effect]Error Contracts: What the Model Should Do Next
When a tool fails, the error message is the next observation. A vague error sends the agent off in the wrong direction. A structured error tells the agent exactly how to recover—or whether to stop.
Design three error categories:
Nominal errors — Expected failures in the normal operating envelope. The agent should handle these autonomously.
{
"error": "NOT_FOUND",
"message": "No customer found matching 'john@example.com'",
"retryable": false,
"suggestion": "Try searching by name or account ID instead."
}Tool errors — The tool itself failed due to a transient condition. The agent may retry.
{
"error": "SERVICE_UNAVAILABLE",
"message": "Customer database is temporarily unavailable.",
"retryable": true,
"retry_after_seconds": 5
}Fatal errors — Stop the agent and escalate to a human.
{
"error": "PERMISSION_DENIED",
"message": "Agent does not have write access to production customer records.",
"retryable": false,
"escalate": true
}The escalate flag is particularly important. When the agent sees it, the loop should terminate and hand off to a human-in-the-loop checkpoint rather than attempting to find a workaround.
The Minimal Tool Surface Principle
More tools is not better. Each additional tool increases the probability that the model picks the wrong one, especially when tool names or descriptions overlap. Define the minimal set of tools that covers the task space, and resist the temptation to expose every capability of your underlying APIs.
If you find yourself with more than 15–20 tools in a single agent's registry, the agent probably needs to be decomposed into specialized sub-agents—each with a focused tool set that fits its domain.
Key Takeaways
- Tool names are prompts—use unambiguous verb-noun pairs and maintain consistent verb vocabulary across your entire tool set.
- Schema descriptions are reasoning inputs, not documentation; write them as instructions to an intelligent but uninformed colleague.
- Classify every tool by its side-effect profile and enforce retry limits accordingly—auto-retrying a destructive tool is a production incident.
- Design structured error responses with explicit
retryableandescalateflags so the agent knows what to do next without guessing. - Minimize the tool surface per agent; more than 15–20 tools signals you need specialized sub-agents with focused registries.
- Validate all tool inputs against the schema before execution—never pass raw model output directly to a function or external API.