Multi-Agent Systems vs One Good Prompt

Ravinder·February 27, 2025·9 min read

AIAgentsLLMArchitecture

The Complexity Trap

There is a category of AI engineering mistake that looks like sophistication: taking a task that a single well-crafted prompt handles fine, decomposing it into five sub-agents, wiring them together with a message bus, adding retry logic, handling partial failures — and ending up with a system that is slower, more expensive, harder to debug, and no more accurate than the original prompt.

Multi-agent systems are genuinely powerful. They are also genuinely overused. The framework hype cycle is real and the default answer to "should I use agents?" is still "probably not yet."

What Multi-Agent Actually Costs

Before the capability comparison, get clear on what you pay for orchestration.

Latency Tax

Every agent boundary adds a round-trip. A three-agent pipeline where each agent takes 2 seconds adds 6 seconds of sequential latency minimum — before accounting for orchestrator calls, serialization, and error handling. Parallel agents help, but coordination overhead is real.

gantt title Sequential vs Parallel Agent Execution dateFormat ss axisFormat %Ss section Single Prompt LLM Call :a1, 00, 3s section 3 Sequential Agents Agent 1 :b1, 00, 2s Agent 2 :b2, after b1, 2s Agent 3 :b3, after b2, 2s section 3 Parallel Agents Agent 1 :c1, 00, 2s Agent 2 :c2, 00, 2s Agent 3 :c3, 00, 2s Aggregator :c4, after c1, 1s

Single prompt: 3 seconds. Sequential pipeline: 6 seconds. Parallel agents: 3 seconds + aggregator. For interactive UX, the latency math often kills the multi-agent approach before you even get to accuracy.

Cost Tax

Each agent gets its own context. If your orchestrator summarizes the task and sends it to three sub-agents, you are paying for the orchestrator's input tokens plus three separate sub-agent contexts. With long system prompts, this multiplies fast.

# Rough cost model
def estimate_pipeline_cost(
    orchestrator_tokens: int,
    agent_contexts: list[int],  # tokens per agent
    output_tokens_per_agent: int,
    n_requests: int,
) -> float:
    # Anthropic claude-opus-4-5 pricing
    INPUT_PRICE = 15.0 / 1_000_000
    OUTPUT_PRICE = 75.0 / 1_000_000
 
    orchestrator_cost = orchestrator_tokens * INPUT_PRICE + output_tokens_per_agent * OUTPUT_PRICE
    agent_cost = sum(ctx * INPUT_PRICE + output_tokens_per_agent * OUTPUT_PRICE for ctx in agent_contexts)
    total_per_request = orchestrator_cost + agent_cost
    return total_per_request * n_requests
 
# 3-agent pipeline vs single prompt
single_prompt = estimate_pipeline_cost(8_000, [], 1_000, 10_000)
pipeline = estimate_pipeline_cost(8_000, [6_000, 6_000, 6_000], 1_000, 10_000)
print(f"Single: ${single_prompt:.2f}/month")   # ~$10
print(f"Pipeline: ${pipeline:.2f}/month")       # ~$40

A 3-agent pipeline costs 3–5× more than an equivalent single prompt for the same task.

Debug Tax

When a multi-agent pipeline gives a wrong answer, the failure could be in any agent. You need logging at every boundary to diagnose. Trace correlation, input/output capture, timing — all of this is boilerplate that does not exist in a single prompt invocation where you have exactly one input and one output.

When One Prompt Wins

flowchart TD A[Task description] --> B{Does it fit in\n128K tokens?} B -- Yes --> C{Is it fundamentally\nsequential reasoning?} C -- Yes --> D[Single prompt.\nDo not add agents.] C -- No --> E{Can you write clear\nstep instructions?} E -- Yes --> D B -- No --> F{Can corpus be\nretrieved via RAG?} F -- Yes --> G[Single prompt + RAG] F -- No --> H{True parallelism\nrequired?} H -- No --> I{Independent subtasks\nthat can't be prompted?} I -- No --> D I -- Yes --> J[Multi-agent] H -- Yes --> J

Case 1: The Task is Sequential Reasoning

Code review, document summarization, Q&A, classification, extraction — all of these are sequential reasoning tasks. One model reading a document and producing an output is exactly what LLMs are trained to do. Breaking this into "Extractor Agent → Analyzer Agent → Formatter Agent" adds latency and cost while the model in the middle step has less context than it would if you did it in one prompt.

Test: Can you describe the task in a single coherent system prompt with clear output format instructions? If yes, use a single prompt.

Case 2: Latency Matters

Real-time applications cannot absorb 6-second multi-agent pipelines. If your p50 latency target is under 3 seconds, you almost certainly cannot use a sequential multi-agent approach. Measure before you architect.

Case 3: You Need to Debug Quickly

Single prompts are trivially debuggable. Input, output, done. When you ship a feature under time pressure and it does not work in production, you want one LLM call to inspect, not a distributed trace across five agents.

If Agent B needs to know what Agent A decided, you cannot cleanly separate them. Passing summaries between agents loses information. The model works best when it holds the full context simultaneously — that is a single prompt.

When Sub-Agents Earn Their Keep

Case 1: True Parallelism with Independent Outputs

Analyze 50 customer reviews simultaneously. Each review is independent — the analysis of review 1 does not affect the analysis of review 2. Parallel agents reduce wall time from 50× to 1× (plus aggregation). This is the canonical "agents worth it" case.

import asyncio
import anthropic
 
client = anthropic.Anthropic()
 
async def analyze_review(review: str, agent_id: int) -> dict:
    response = await asyncio.get_event_loop().run_in_executor(
        None,
        lambda: client.messages.create(
            model="claude-haiku-4-5",  # Use cheaper model for parallel subtasks
            max_tokens=512,
            messages=[{"role": "user", "content": f"Analyze sentiment and key themes:\n\n{review}"}],
        )
    )
    return {"id": agent_id, "analysis": response.content[0].text}
 
async def analyze_reviews_parallel(reviews: list[str]) -> list[dict]:
    tasks = [analyze_review(r, i) for i, r in enumerate(reviews)]
    return await asyncio.gather(*tasks)

Note: use the cheapest model for parallel subtasks. Haiku at $0.25/M input is 60× cheaper than Opus for tasks that do not need deep reasoning.

Case 2: Context Window Cannot Hold the Full Task

A task that requires reasoning over 2M tokens of data cannot fit in any single context window. The solution is hierarchical: agents process chunks, a synthesizer aggregates. This is the "map-reduce" pattern for LLMs and it is legitimate.

flowchart TD D[2M token corpus] --> C1[Chunk 1\n100K tokens] D --> C2[Chunk 2\n100K tokens] D --> C3[Chunk N\n100K tokens] C1 --> A1[Extractor Agent] C2 --> A2[Extractor Agent] C3 --> A3[Extractor Agent] A1 --> S[Synthesizer Agent] A2 --> S A3 --> S S --> R[Final Report]

Case 3: Specialized Capabilities Require Different Models

A code generation agent (Sonnet) feeding into a code execution agent (running in a sandbox) feeding into a test evaluation agent (Haiku) — each step uses the model and environment appropriate to its task. This is not just organizational preference; it can actually reduce cost and improve quality by matching model capability to subtask complexity.

Case 4: Human-in-the-Loop Between Steps

If your workflow requires human approval between stages (draft → review → publish, or plan → human approves → execute), agent boundaries map naturally to those approval gates. The agents are not just organizational — they represent real checkpoints in the workflow.

The "Multi-Agent" Prompt Pattern

Before building real agent infrastructure, try this: give a single model a persona for each "agent" and ask it to simulate the pipeline.

You will approach this task as three specialists working sequentially:
 
**Researcher**: Extract all relevant facts from the provided documents.
**Analyst**: Given the extracted facts, identify patterns and implications.  
**Writer**: Given the analysis, draft a clear executive summary.
 
Work through each role in order. Label each section clearly.
 
Documents: [...]
Task: [...]

This covers 40% of the cases where people reach for multi-agent. It is slower than parallel agents but faster to build, cheaper, and easier to debug. If the simulated pipeline gives good results, consider whether you actually need the real infrastructure.

Decision Checklist

Before adding an agent to your architecture, answer these:

Does this subtask need to run in parallel with others? If no, question whether the boundary is necessary.
Is the input context too large for one model? If no, a single call probably works.
Does this subtask have a different latency budget? If no, sequential agents just add latency.
Will a wrong result from this agent cascade and corrupt downstream agents? If yes, you need robust error handling at every boundary — that is real engineering work.
Can you test and debug this agent independently? If no, the abstraction is wrong.
Does this subtask use a different model or tool environment? This is the strongest justification for a boundary.

Score: 0–2 yes answers → use a single prompt. 3–4 → consider agents carefully. 5–6 → multi-agent is probably the right call.

The Honest Recommendation

Start with a single prompt. Get it working. Measure quality and latency. If quality is limited by context (the model can not see all relevant information simultaneously), add RAG. If latency is limited by sequential work that could be parallelized, add parallel sub-agents for those specific operations. If a stage genuinely requires a different model or execution environment, add that agent boundary.

Build incrementally. Each added agent should have a measured justification: "adding this parallel agent reduced p50 latency from 8s to 2s" or "splitting the extraction step reduced error rate from 15% to 3%." If you cannot state the measured improvement, you do not know if the agent is earning its keep.

The teams with the best AI systems I have seen are not the ones with the most agents. They are the ones who kept it simple until simplicity was provably insufficient.

Key Takeaways

A 3-agent sequential pipeline costs 3–5× more in tokens and adds 2–4 seconds of latency compared to an equivalent single prompt — both require explicit justification.
Single prompts win for sequential reasoning tasks, latency-sensitive UX, fast debugging needs, and tasks where subtasks share context.
Sub-agents earn their keep for true parallelism over independent inputs, map-reduce over corpora exceeding context limits, and workflows with real human-in-the-loop checkpoints.
Try the "simulated agents" single-prompt pattern before building real orchestration — it covers a surprising fraction of use cases.
Use cheaper models (Haiku vs Opus) for parallel subtasks that do not require deep reasoning; the quality difference is minimal and the cost difference is 60×.
Add agents incrementally with measured justification: each boundary should correspond to a specific, quantified improvement in latency, cost, or quality.