Multi-Agent Systems vs One Good Prompt
The Complexity Trap
There is a category of AI engineering mistake that looks like sophistication: taking a task that a single well-crafted prompt handles fine, decomposing it into five sub-agents, wiring them together with a message bus, adding retry logic, handling partial failures — and ending up with a system that is slower, more expensive, harder to debug, and no more accurate than the original prompt.
Multi-agent systems are genuinely powerful. They are also genuinely overused. The framework hype cycle is real and the default answer to "should I use agents?" is still "probably not yet."
What Multi-Agent Actually Costs
Before the capability comparison, get clear on what you pay for orchestration.
Latency Tax
Every agent boundary adds a round-trip. A three-agent pipeline where each agent takes 2 seconds adds 6 seconds of sequential latency minimum — before accounting for orchestrator calls, serialization, and error handling. Parallel agents help, but coordination overhead is real.
Single prompt: 3 seconds. Sequential pipeline: 6 seconds. Parallel agents: 3 seconds + aggregator. For interactive UX, the latency math often kills the multi-agent approach before you even get to accuracy.
Cost Tax
Each agent gets its own context. If your orchestrator summarizes the task and sends it to three sub-agents, you are paying for the orchestrator's input tokens plus three separate sub-agent contexts. With long system prompts, this multiplies fast.
# Rough cost model
def estimate_pipeline_cost(
orchestrator_tokens: int,
agent_contexts: list[int], # tokens per agent
output_tokens_per_agent: int,
n_requests: int,
) -> float:
# Anthropic claude-opus-4-5 pricing
INPUT_PRICE = 15.0 / 1_000_000
OUTPUT_PRICE = 75.0 / 1_000_000
orchestrator_cost = orchestrator_tokens * INPUT_PRICE + output_tokens_per_agent * OUTPUT_PRICE
agent_cost = sum(ctx * INPUT_PRICE + output_tokens_per_agent * OUTPUT_PRICE for ctx in agent_contexts)
total_per_request = orchestrator_cost + agent_cost
return total_per_request * n_requests
# 3-agent pipeline vs single prompt
single_prompt = estimate_pipeline_cost(8_000, [], 1_000, 10_000)
pipeline = estimate_pipeline_cost(8_000, [6_000, 6_000, 6_000], 1_000, 10_000)
print(f"Single: ${single_prompt:.2f}/month") # ~$10
print(f"Pipeline: ${pipeline:.2f}/month") # ~$40A 3-agent pipeline costs 3–5× more than an equivalent single prompt for the same task.
Debug Tax
When a multi-agent pipeline gives a wrong answer, the failure could be in any agent. You need logging at every boundary to diagnose. Trace correlation, input/output capture, timing — all of this is boilerplate that does not exist in a single prompt invocation where you have exactly one input and one output.
When One Prompt Wins
Case 1: The Task is Sequential Reasoning
Code review, document summarization, Q&A, classification, extraction — all of these are sequential reasoning tasks. One model reading a document and producing an output is exactly what LLMs are trained to do. Breaking this into "Extractor Agent → Analyzer Agent → Formatter Agent" adds latency and cost while the model in the middle step has less context than it would if you did it in one prompt.
Test: Can you describe the task in a single coherent system prompt with clear output format instructions? If yes, use a single prompt.
Case 2: Latency Matters
Real-time applications cannot absorb 6-second multi-agent pipelines. If your p50 latency target is under 3 seconds, you almost certainly cannot use a sequential multi-agent approach. Measure before you architect.
Case 3: You Need to Debug Quickly
Single prompts are trivially debuggable. Input, output, done. When you ship a feature under time pressure and it does not work in production, you want one LLM call to inspect, not a distributed trace across five agents.
Case 4: The Subtasks Share Context
If Agent B needs to know what Agent A decided, you cannot cleanly separate them. Passing summaries between agents loses information. The model works best when it holds the full context simultaneously — that is a single prompt.
When Sub-Agents Earn Their Keep
Case 1: True Parallelism with Independent Outputs
Analyze 50 customer reviews simultaneously. Each review is independent — the analysis of review 1 does not affect the analysis of review 2. Parallel agents reduce wall time from 50× to 1× (plus aggregation). This is the canonical "agents worth it" case.
import asyncio
import anthropic
client = anthropic.Anthropic()
async def analyze_review(review: str, agent_id: int) -> dict:
response = await asyncio.get_event_loop().run_in_executor(
None,
lambda: client.messages.create(
model="claude-haiku-4-5", # Use cheaper model for parallel subtasks
max_tokens=512,
messages=[{"role": "user", "content": f"Analyze sentiment and key themes:\n\n{review}"}],
)
)
return {"id": agent_id, "analysis": response.content[0].text}
async def analyze_reviews_parallel(reviews: list[str]) -> list[dict]:
tasks = [analyze_review(r, i) for i, r in enumerate(reviews)]
return await asyncio.gather(*tasks)Note: use the cheapest model for parallel subtasks. Haiku at $0.25/M input is 60× cheaper than Opus for tasks that do not need deep reasoning.
Case 2: Context Window Cannot Hold the Full Task
A task that requires reasoning over 2M tokens of data cannot fit in any single context window. The solution is hierarchical: agents process chunks, a synthesizer aggregates. This is the "map-reduce" pattern for LLMs and it is legitimate.
Case 3: Specialized Capabilities Require Different Models
A code generation agent (Sonnet) feeding into a code execution agent (running in a sandbox) feeding into a test evaluation agent (Haiku) — each step uses the model and environment appropriate to its task. This is not just organizational preference; it can actually reduce cost and improve quality by matching model capability to subtask complexity.
Case 4: Human-in-the-Loop Between Steps
If your workflow requires human approval between stages (draft → review → publish, or plan → human approves → execute), agent boundaries map naturally to those approval gates. The agents are not just organizational — they represent real checkpoints in the workflow.
The "Multi-Agent" Prompt Pattern
Before building real agent infrastructure, try this: give a single model a persona for each "agent" and ask it to simulate the pipeline.
You will approach this task as three specialists working sequentially:
**Researcher**: Extract all relevant facts from the provided documents.
**Analyst**: Given the extracted facts, identify patterns and implications.
**Writer**: Given the analysis, draft a clear executive summary.
Work through each role in order. Label each section clearly.
Documents: [...]
Task: [...]This covers 40% of the cases where people reach for multi-agent. It is slower than parallel agents but faster to build, cheaper, and easier to debug. If the simulated pipeline gives good results, consider whether you actually need the real infrastructure.
Decision Checklist
Before adding an agent to your architecture, answer these:
- Does this subtask need to run in parallel with others? If no, question whether the boundary is necessary.
- Is the input context too large for one model? If no, a single call probably works.
- Does this subtask have a different latency budget? If no, sequential agents just add latency.
- Will a wrong result from this agent cascade and corrupt downstream agents? If yes, you need robust error handling at every boundary — that is real engineering work.
- Can you test and debug this agent independently? If no, the abstraction is wrong.
- Does this subtask use a different model or tool environment? This is the strongest justification for a boundary.
Score: 0–2 yes answers → use a single prompt. 3–4 → consider agents carefully. 5–6 → multi-agent is probably the right call.
The Honest Recommendation
Start with a single prompt. Get it working. Measure quality and latency. If quality is limited by context (the model can not see all relevant information simultaneously), add RAG. If latency is limited by sequential work that could be parallelized, add parallel sub-agents for those specific operations. If a stage genuinely requires a different model or execution environment, add that agent boundary.
Build incrementally. Each added agent should have a measured justification: "adding this parallel agent reduced p50 latency from 8s to 2s" or "splitting the extraction step reduced error rate from 15% to 3%." If you cannot state the measured improvement, you do not know if the agent is earning its keep.
The teams with the best AI systems I have seen are not the ones with the most agents. They are the ones who kept it simple until simplicity was provably insufficient.
Key Takeaways
- A 3-agent sequential pipeline costs 3–5× more in tokens and adds 2–4 seconds of latency compared to an equivalent single prompt — both require explicit justification.
- Single prompts win for sequential reasoning tasks, latency-sensitive UX, fast debugging needs, and tasks where subtasks share context.
- Sub-agents earn their keep for true parallelism over independent inputs, map-reduce over corpora exceeding context limits, and workflows with real human-in-the-loop checkpoints.
- Try the "simulated agents" single-prompt pattern before building real orchestration — it covers a surprising fraction of use cases.
- Use cheaper models (Haiku vs Opus) for parallel subtasks that do not require deep reasoning; the quality difference is minimal and the cost difference is 60×.
- Add agents incrementally with measured justification: each boundary should correspond to a specific, quantified improvement in latency, cost, or quality.