In our previous posts, we broke down the individual components of production AI agents: how the tool-calling loop works, how system prompts govern behavior, how MCP connects agents to business systems, and how to configure extension points in practice. Each of those posts examined a single agent doing a single job.
This post is about what happens when one agent isn’t enough.
2025 was the year of single AI agents. 2026 is the year they start working together. The AI agent market is growing at 46% year over year, and Gartner projects that 40% of enterprise applications will feature task-specific AI agents by the end of this year — up from less than 5% in 2025. But Gartner also predicts that over 40% of agentic AI projects will be canceled by 2027, and the primary killers are cost overruns, coordination complexity, and inadequate governance.
Multi-agent systems are powerful. They’re also where most of the money gets wasted. Understanding the architecture — what works, what breaks, and what it actually costs — is the difference between a system that compounds value and one that compounds expenses.
When You Actually Need Multiple Agents (And When You Don’t)
Before we get into patterns and code, a critical question: do you need a multi-agent system at all?
The honest answer for most use cases is no. A well-designed single agent with good tools handles the majority of business workflows. Multi-agent complexity is justified only when you hit specific architectural limits:
You need multiple agents when:
- A single prompt can’t hold all the context, instructions, and role constraints for the task (the agent needs to be a specialist, not a generalist)
- The workflow requires genuinely different reasoning strategies at different stages (classification vs. research vs. generation)
- You need parallel execution — multiple independent subtasks that can run simultaneously to reduce latency
- Different stages require different models (expensive frontier models for complex reasoning, cheaper models for high-frequency execution)
- Security boundaries demand isolation — the agent handling customer PII shouldn’t also have access to financial systems
You don’t need multiple agents when:
- A single agent with well-designed tools can handle the workflow end to end
- The complexity is in the tools, not in the reasoning (add more MCP servers, not more agents)
- You’re adding agents because the architecture diagram looks better with more boxes
The principle from our scaling post applies here: start with a single agent, add tools, and only introduce multi-agent patterns when you hit clear architectural limits. LangChain’s own recommendation is blunt: “Start with single agents and good prompt engineering. Introduce tools before adding multiple agents.”
The Five Orchestration Patterns
The multi-agent landscape has converged on five well-defined patterns. Each solves a different coordination problem. Choosing the wrong one is the most common architectural mistake — and it’s usually not recoverable without a rewrite.
Pattern 1: Orchestrator-Worker (Hierarchical)
┌──────────────┐
│ Orchestrator │
│ (Planner) │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼─────┐ ┌───▼────┐ ┌────▼─────┐
│ Worker A │ │Worker B│ │ Worker C │
│(Researcher)│ │(Writer)│ │(Reviewer) │
└────────────┘ └────────┘ └──────────┘
A central orchestrator receives the user’s request, decomposes it into subtasks, delegates to specialized workers, monitors progress, and synthesizes a final response. The orchestrator holds the plan; workers execute independently and report back.
When to use it: Complex, multi-step workflows where you need reasoning transparency and the ability to re-plan mid-execution. This is the pattern Anthropic uses in their own multi-agent research system — a lead agent (Claude Opus) coordinates 3-5 specialized subagents (Claude Sonnet) running in parallel. Their internal evaluation showed it outperformed a single Opus agent by 90.2% on research tasks.
Tradeoff: The orchestrator is a single point of failure and a latency bottleneck. If it makes a bad plan, every worker executes the wrong thing efficiently.
Pattern 2: Sequential Pipeline (Assembly Line)
┌─────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│Classifier│───▶│ Enricher │───▶│ Validator │───▶│ Executor │
└─────────┘ └──────────┘ └───────────┘ └──────────┘
Agents arranged like a manufacturing assembly line. Each agent performs one transformation and passes its output to the next. Linear, deterministic, and easy to debug.
When to use it: Document processing, data pipelines, and any workflow where the stages are fixed and always run in the same order. Google Cloud calls this the “digital assembly line” — a paradigm where specialized agents orchestrate business workflows end to end, like a physical assembly line applied to knowledge work.
Tradeoff: No parallelism. Total latency equals the sum of all stages. If the third agent in a five-stage pipeline fails, stages four and five are blocked.
Pattern 3: Parallel Fan-Out / Gather
┌──────────┐
│ Splitter │
└────┬─────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼─────┐ ┌───▼────┐ ┌────▼─────┐
│ Style │ │Security│ │Perf │
│ Reviewer │ │Auditor │ │Analyzer │
└─────┬──────┘ └───┬────┘ └────┬─────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ Synthesizer │
└─────────────┘
Multiple agents operate simultaneously on independent aspects of the same task. A splitter distributes the work; a synthesizer aggregates the results. This pattern can reduce processing time by 60-80% compared to sequential execution.
When to use it: When subtasks are genuinely independent. Code review (style, security, performance can run in parallel), multi-source research, simultaneous data analysis across different dimensions.
Tradeoff: Requires a robust synthesizer that can reconcile potentially conflicting outputs. The splitter must correctly identify independent subtasks — if there are hidden dependencies between them, parallel execution produces inconsistent results.
Pattern 4: Router (Dispatch)
┌────────┐
│ Router │
└───┬────┘
│
┌─────────────┼──────────────┐
│ │ │
┌─────▼──────┐ ┌───▼────┐ ┌──────▼───────┐
│ Billing │ │Technical│ │ General │
│ Specialist │ │ Support │ │ Inquiry │
└────────────┘ └────────┘ └──────────────┘
A routing step classifies input and directs it to the appropriate specialist. Each specialist handles one category independently. The router is typically the cheapest component — a small model or even a rule-based classifier.
When to use it: Customer support triage, request categorization, any workflow where the first decision is “what kind of problem is this?” and each type requires fundamentally different handling.
Tradeoff: Routing errors cascade. If the router sends a billing question to technical support, the specialist will either fail or give a wrong answer. Routing accuracy is the ceiling on system accuracy.
Pattern 5: Generator-Critic
┌───────────┐ ┌────────┐
│ Generator │────▶│ Critic │──── Output
└─────┬─────┘ └───┬────┘
│ │
└───── Retry ◀───┘
One agent generates content; another evaluates it against defined criteria. If the output fails review, it cycles back to the generator with feedback. This separates creation from validation — the generator can be creative while the critic enforces constraints.
When to use it: Content generation, code writing, any workflow where quality standards are well-defined and verifiable. Particularly effective when the generation and evaluation tasks benefit from different prompting strategies or even different models.
Tradeoff: Can loop indefinitely if the generator can’t satisfy the critic’s standards. Always set a maximum retry count and a fallback path.
Building a Multi-Agent System: A Practical Example
Let’s build something concrete — a three-agent customer support triage system that a growing business could actually deploy. This example uses the orchestrator-worker pattern with the Anthropic API, though the architecture translates to any LLM provider.
The system handles incoming support tickets: a router classifies them, a knowledge agent searches documentation for relevant answers, and a response agent drafts a reply.
# multi_agent_support.py
import anthropic
import json
from dataclasses import dataclass
client = anthropic.Anthropic()
# Use different models for different roles —
# expensive reasoning where it matters, cheap execution elsewhere
ROUTER_MODEL = "claude-haiku-4-5-20251001" # Fast, cheap classification
KNOWLEDGE_MODEL = "claude-sonnet-4-6" # Balanced for retrieval
RESPONSE_MODEL = "claude-sonnet-4-6" # Balanced for generation
@dataclass
class Ticket:
id: str
subject: str
body: str
customer_email: str
@dataclass
class AgentResult:
agent: str
success: bool
output: str
tokens_used: int
def run_agent(
model: str,
system_prompt: str,
user_message: str,
tools: list | None = None
) -> AgentResult:
"""Execute a single agent call with error handling."""
try:
kwargs = {
"model": model,
"max_tokens": 1024,
"system": system_prompt,
"messages": [{"role": "user", "content": user_message}],
}
if tools:
kwargs["tools"] = tools
response = client.messages.create(**kwargs)
# Extract text content
text_parts = [
block.text for block in response.content
if block.type == "text"
]
output = "\n".join(text_parts)
return AgentResult(
agent=model,
success=True,
output=output,
tokens_used=response.usage.input_tokens
+ response.usage.output_tokens,
)
except Exception as e:
return AgentResult(
agent=model,
success=False,
output=f"Agent error: {str(e)}",
tokens_used=0,
)
# --- Agent 1: Router ---
ROUTER_PROMPT = """You are a support ticket classifier. Categorize each
ticket into exactly one category and respond with valid JSON only.
Categories:
- billing: payment issues, invoices, subscription changes, refunds
- technical: bugs, errors, integration problems, API issues
- general: feature requests, how-to questions, account setup
- urgent: service outages, security concerns, data loss
Respond with: {"category": "<category>", "confidence": <0.0-1.0>}"""
def route_ticket(ticket: Ticket) -> dict:
"""Classify a ticket using the router agent."""
result = run_agent(
model=ROUTER_MODEL,
system_prompt=ROUTER_PROMPT,
user_message=f"Subject: {ticket.subject}\nBody: {ticket.body}",
)
if not result.success:
# Fallback: route to general queue for human review
return {"category": "general", "confidence": 0.0,
"tokens": result.tokens_used}
try:
classification = json.loads(result.output)
classification["tokens"] = result.tokens_used
return classification
except json.JSONDecodeError:
return {"category": "general", "confidence": 0.0,
"tokens": result.tokens_used}
# --- Agent 2: Knowledge Retrieval ---
KNOWLEDGE_PROMPT = """You are a knowledge retrieval agent for a SaaS
company. Given a support ticket category and content, search the
documentation and return the most relevant information.
You have access to a documentation search tool. Use it to find relevant
articles, then summarize the key information the response agent will
need to draft a reply.
Be concise. Return only the information needed to solve the customer's
problem — not the entire article."""
def search_knowledge(ticket: Ticket, category: str) -> AgentResult:
"""Retrieve relevant documentation for the ticket."""
# In production, this agent would call MCP servers
# connected to your knowledge base, CRM, and order system.
# See: /lab/anatomy-of-ai-agent-mcp/
tools = [
{
"name": "search_docs",
"description": (
"Search the documentation knowledge base. "
"Returns relevant articles and their content."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query for documentation",
}
},
"required": ["query"],
},
}
]
return run_agent(
model=KNOWLEDGE_MODEL,
system_prompt=KNOWLEDGE_PROMPT,
user_message=(
f"Category: {category}\n"
f"Subject: {ticket.subject}\n"
f"Body: {ticket.body}"
),
tools=tools,
)
# --- Agent 3: Response Drafting ---
RESPONSE_PROMPT = """You are a customer support response agent. Draft a
helpful, professional reply to the customer's ticket using the knowledge
base information provided.
Rules:
- Address the customer's specific issue directly
- Include relevant steps or documentation links
- Use a warm but professional tone
- If the knowledge base doesn't cover the issue, acknowledge this
and explain that a specialist will follow up
- Keep responses under 200 words
- Never make up information that wasn't in the knowledge base"""
def draft_response(
ticket: Ticket,
category: str,
knowledge: str,
) -> AgentResult:
"""Draft a customer response using retrieved knowledge."""
return run_agent(
model=RESPONSE_MODEL,
system_prompt=RESPONSE_PROMPT,
user_message=(
f"Customer email: {ticket.customer_email}\n"
f"Category: {category}\n"
f"Subject: {ticket.subject}\n"
f"Body: {ticket.body}\n\n"
f"--- Knowledge Base Results ---\n{knowledge}"
),
)
# --- Orchestrator ---
def process_ticket(ticket: Ticket) -> dict:
"""
Full pipeline: route → retrieve knowledge → draft response.
Returns a dict with the classification, draft response,
total tokens used, and whether human review is needed.
"""
total_tokens = 0
# Step 1: Route
classification = route_ticket(ticket)
total_tokens += classification.get("tokens", 0)
category = classification["category"]
confidence = classification["confidence"]
# Circuit breaker: low confidence or urgent → human queue
if confidence < 0.7 or category == "urgent":
return {
"ticket_id": ticket.id,
"category": category,
"confidence": confidence,
"action": "escalate_to_human",
"reason": "low confidence" if confidence < 0.7
else "urgent category",
"draft_response": None,
"total_tokens": total_tokens,
}
# Step 2: Retrieve knowledge
knowledge_result = search_knowledge(ticket, category)
total_tokens += knowledge_result.tokens_used
if not knowledge_result.success:
return {
"ticket_id": ticket.id,
"category": category,
"confidence": confidence,
"action": "escalate_to_human",
"reason": "knowledge retrieval failed",
"draft_response": None,
"total_tokens": total_tokens,
}
# Step 3: Draft response
response_result = draft_response(
ticket, category, knowledge_result.output
)
total_tokens += response_result.tokens_used
return {
"ticket_id": ticket.id,
"category": category,
"confidence": confidence,
"action": "send_draft" if response_result.success
else "escalate_to_human",
"draft_response": response_result.output,
"total_tokens": total_tokens,
}
Several design decisions in this code are worth calling out:
Model tiering is built in. The router uses Haiku (fast, cheap) because classification doesn’t require deep reasoning. The knowledge and response agents use Sonnet (balanced). In a more complex system, you might use Opus for the orchestrator and Haiku for high-frequency workers — matching model cost to reasoning complexity. This single decision can cut multi-agent inference costs by 40-60%.
Circuit breakers prevent cascading failures. Low-confidence classifications and urgent tickets route directly to humans instead of flowing through agents that might handle them poorly. This is the same principle from our system prompt analysis — define what the agent should not attempt, not just what it should do.
Every agent returns a structured result. The AgentResult dataclass includes token counts, success status, and the output. This makes cost tracking and debugging trivial. In production, you’d extend this with latency, model version, and a trace ID that links all three agents’ calls for a single ticket.
Failures degrade gracefully. If any agent fails, the system escalates to a human instead of crashing or producing garbage. The worst outcome is a ticket that takes the same path it would have without any AI — not a wrong answer sent to a customer.
The Protocol Layer: MCP and A2A
Two protocols are shaping how multi-agent systems communicate in production, and they solve different problems:
MCP (Model Context Protocol) handles agent-to-tool communication — how an agent accesses external systems. We covered this in depth in our MCP architecture post. In a multi-agent system, each specialized agent connects to its own set of MCP servers. The knowledge agent connects to the documentation server. The billing agent connects to the payment system server. MCP ensures each agent speaks a consistent protocol regardless of what backend system it’s accessing.
A2A (Agent-to-Agent Protocol) handles agent-to-agent communication — how agents discover each other, negotiate capabilities, and hand off work. Launched by Google in April 2025, A2A enables peer-to-peer coordination without a centralized orchestrator. Each agent publishes an “agent card” describing its capabilities, and other agents can discover and invoke those capabilities at runtime.
Both protocols are now under the Linux Foundation’s Agentic AI Foundation, co-founded by OpenAI, Anthropic, Google, Microsoft, AWS, and Block. This matters because it signals that agent interoperability is becoming infrastructure, not a competitive differentiator.
The practical takeaway: MCP and A2A are complementary, not competing. MCP standardizes how each agent connects to its tools. A2A standardizes how agents connect to each other. Together, they create an interoperable foundation for multi-agent workflows — but adopting them consistently across your agent fleet is what makes them valuable. Inconsistent adoption creates the same integration mess they were designed to eliminate.
The Cost Reality
Multi-agent systems are expensive. Not theoretically — the numbers are specific and predictable if you measure them.
Anthropic’s own data shows that agents use approximately 4x more tokens than a standard chat interaction. Multi-agent systems use roughly 15x more. Here’s why:
Each agent in the system receives a system prompt (200-2,000 tokens), the relevant context (500-5,000 tokens), and produces a response (200-1,000 tokens). A three-agent pipeline processing one ticket consumes 3,000-24,000 tokens. At production scale:
| Scale | Agents | Daily Tickets | Daily Tokens | Monthly Cost (est.) |
|---|---|---|---|---|
| Pilot | 3 | 100 | 1.5M | $15-45 |
| Growth | 3 | 1,000 | 15M | $150-450 |
| Scale | 5 | 5,000 | 125M | $1,250-3,750 |
| Enterprise | 10 | 20,000 | 1B+ | $10,000-30,000 |
The jump from pilot to production is where budgets explode. At 100 tickets per day, a three-agent system is practically free. At 5,000 tickets per day, you’re paying for a junior employee. At 20,000, you’re paying for a team — and you’d better be measuring the ROI to justify it.
Cost optimization strategies that actually work:
Model tiering. Use the cheapest model that meets quality requirements for each role. Classification rarely needs a frontier model. Synthesis often does. The code example above demonstrates this pattern — Haiku for routing, Sonnet for knowledge and response.
Aggressive caching. If 30% of support tickets ask about the same five topics, cache the knowledge retrieval results. This alone can cut token consumption by 20-40% at scale.
Token budgets per agent. Set hard limits on context and response length for each agent. A router that produces 50 tokens per classification doesn’t need a 4,096-token max. Reducing max tokens reduces wasted computation on runaway responses.
Short-circuit common paths. If the router classifies a ticket as a known FAQ category with high confidence, skip the knowledge agent entirely and route to a cached response template. Not every ticket needs the full pipeline.
Watchdog agents. A small, cheap model that monitors the expensive agents for runaway behavior — token budget exceeded, excessive retries, or anomalous patterns. Adds 10-20% to compute cost but prevents the tail-risk scenarios that blow up monthly bills.
The Seven Failure Modes
Academic research (ICLR 2025) identified 14 distinct failure modes in multi-agent systems across 150+ execution traces. We’ve condensed these into the seven patterns that matter most in production, based on what we’ve seen in real deployments.
1. Coordination Tax
Complexity grows combinatorially, not linearly. Three agents have 3 interaction paths. Five agents have 10. Ten agents have 45. Each interaction path is a potential failure point, and debugging requires understanding the full conversation history across all agents.
Mitigation: Keep teams small — 3-5 agents per workflow. If you need more, use hierarchical structures where a team leader agent manages a subgroup, and only team leaders communicate with each other.
2. Token Cost Explosion
The demo costs $6. Production costs $18,000 per month. This isn’t hypothetical — it’s the math for three agents handling 10,000 requests daily at typical token consumption rates.
Mitigation: Model tiering, caching, token budgets, and real-time cost monitoring with threshold alerts. Budget 15-25% of your initial build cost annually for ongoing operational expenses.
3. Latency Cascades
Sequential agents compound latency. If agents A, B, and C take 3, 4, and 5 seconds respectively, the total response time is 12 seconds. Research shows 53% of users abandon interactions after 3 seconds.
Mitigation: Run independent agents in parallel wherever possible. Use async processing for non-critical analysis. Set timeout limits per agent — a hung knowledge retrieval shouldn’t hold up the entire pipeline.
4. The Reliability Paradox
Individual reliability multiplies into system unreliability. Five agents at 95% reliability each produce a system that’s 77% reliable (0.95^5). At 10,000 daily interactions, that’s 2,300 failures instead of 500.
Mitigation: Circuit breakers that gracefully degrade to human handling or single-agent fallbacks. Health checks on each agent. Design the system so that any single agent failing doesn’t produce a wrong answer — it produces an escalation.
5. Role Confusion
Agents exceed their designated responsibilities. A pricing agent approves a contract it should have flagged for review. A research agent executes an action it should have recommended. This happens because LLMs are optimized to be helpful, and “staying in your lane” requires explicit, repeated reinforcement in the system prompt.
Mitigation: Explicit role boundaries in every agent’s system prompt. Static action allowlists — define exactly which tools each agent can call. The system prompt patterns we analyzed in the Claude Code breakdown show how to encode these boundaries effectively: tell the agent what it must never do, not just what it should do.
6. Observability Black Box
When a multi-agent system produces a bad output, which agent was responsible? Debugging multi-agent failures takes 3-5x longer than single-agent failures because you need to trace the full chain of reasoning across multiple agents.
Mitigation: Structured logging at every agent boundary. Unique trace IDs that link all agent calls for a single request. Log the complete input, output, model version, and latency for every agent invocation. Build dashboards that visualize the flow, not just the endpoints.
7. Prompt Injection Across Boundaries
In a multi-agent system, agents trust each other’s outputs by default. If one agent’s output contains injected instructions — whether from a malicious customer input that survived the first agent or from a compromised MCP server response — downstream agents may follow those instructions. Five agents create 20 possible attack vectors across trust boundaries.
Mitigation: Input validation at every agent boundary, not just the first one. Treat every inter-agent message with the same caution you’d apply to external input. The MCP security considerations we covered — server-side sanitization, system prompt hardening, output validation — apply at every agent handoff, not just at the tool-calling layer.
Best Practices for Production Multi-Agent Systems
These are the patterns that separate systems that survive production from the ones that get abandoned:
1. Start with a single agent and add complexity only when you hit limits. This is the most ignored advice in the space. A single well-prompted agent with good tools handles 80% of use cases. The process-first approach applies doubly here — redesign the workflow before deciding how many agents it needs.
2. Treat multi-agent systems as distributed systems. Apply the same rigor you’d use for microservices: circuit breakers, health checks, fallback paths, structured logging, timeout handling, and retry budgets. If your team doesn’t have distributed systems experience, this is a gap that will bite you in production.
3. Use typed schemas between agents. Define explicit data contracts for inter-agent communication. Don’t pass free-text between agents and hope they parse it correctly. JSON schemas with validation at each boundary prevent the most common class of multi-agent failures.
4. Design for failure, not just success. Every agent boundary is a potential failure point. The question isn’t “what happens when this works?” — it’s “what happens when the second agent fails?” If the answer is “the system crashes” or “the customer gets a wrong answer,” redesign the failure path before deploying.
5. Instrument everything from day one. You cannot debug a multi-agent system after the fact. Token counts, latency, routing decisions, confidence scores, error rates — per agent, per request, with trace IDs that connect the full chain. The cost of instrumentation is trivial compared to the cost of debugging a production incident without it.
6. Budget for 2-3x your pilot costs. Every team underestimates the operational cost of multi-agent systems. Token consumption, monitoring infrastructure, escalation handling, prompt iteration cycles, and the engineering time to debug distributed failures. If your ROI calculation only accounts for the pilot-phase cost, it’s wrong.
Where This Is Heading
Multi-agent systems in 2026 are roughly where microservices were in 2015 — the architecture is sound, the tooling is immature, and most teams are learning expensive lessons that will become standard knowledge in two years.
The trajectory is clear: standardized protocols (MCP for tools, A2A for agent coordination), framework consolidation (LangGraph and CrewAI emerging as the production leaders), and model tiering strategies that make multi-agent economics viable for growing businesses — not just enterprises burning through VC funding.
For teams building today, the practical path is:
- Build your first agents as singles on solid foundations — tool calling, good system prompts, MCP integration.
- When single-agent limits emerge, add a second agent with explicit boundaries and typed schemas between them.
- Expand to 3-5 agents only after you’ve built the observability and failure-handling infrastructure.
- Measure relentlessly — the ROI framework should cover the full cost, not just the API bill.
88% of early adopters of agentic AI report positive ROI on at least one use case. The key is the discipline to get one system right before scaling to many — the same compounding strategy that works at the organizational level works at the architectural level.
We build production multi-agent systems for growing businesses. If you’re at the point where a single agent isn’t enough and want to avoid the failure modes that take down most multi-agent projects — let’s talk.