The Dance of AI Agents: How Multi-Agent Systems Actually Work

Something interesting happens when you watch a great team. Each person knows their role, understands when to step in, and — more importantly — when to step back. They move with a fluid efficiency that makes complex work look simple.

I’ve been trying to build AI systems that work the same way. Not a single monolithic model that does everything, but a network of specialized agents that collaborate, delegate, and route around each other’s limitations. The results have been surprising — both in what works and in what fails spectacularly.

The most important lesson: multi-agent AI isn’t about making systems smarter. It’s about making them smaller. A network of focused agents, each with a narrow responsibility and clear handoff protocols, consistently outperforms a single agent trying to do everything. Not because the individual agents are more capable, but because the architecture constrains them to do less — and do it well.

The Single-Agent Ceiling

Most AI applications today are single-agent systems: one model, one system prompt, one set of tools. You give it a complex task, it reasons through it, calls some tools, and produces output. This works well for simple tasks. It breaks down predictably for complex ones.

The failure mode is always the same: context pollution. As the task gets more complex, the system prompt grows. More tools are added. The model must hold more state in its context window. And as we saw with the 14K Token Debt, attention degrades as context grows — the model starts losing track of its instructions, its tools interfere with each other, and output quality drops.

Anthropic documented this pattern precisely in their “Building Effective Agents” guide (December 2024). Their central recommendation: start simple, and only add complexity when simpler solutions fall short. But when you do need complexity, the answer isn’t a bigger single agent — it’s multiple agents working together.

The insight maps to a broader principle from software engineering: the single responsibility principle. A function that does one thing well is easier to test, debug, and maintain than a function that does ten things. The same applies to agents.

Five Workflow Patterns That Actually Work

Anthropic’s taxonomy of agent collaboration patterns is the clearest framework I’ve seen. They distinguish between workflows (LLMs orchestrated through predefined code paths) and agents (LLMs that dynamically direct their own processes). Most production systems are workflows, not agents — and that’s a good thing.

Here are the five workflow patterns, ordered by complexity:

1. Prompt Chaining

The simplest pattern: break a task into sequential steps, where each step’s output feeds the next step’s input. Each step can have its own focused prompt, its own validation gate, and its own error handling.

[Generate outline] → [Gate: outline valid?] → [Write section 1] → [Write section 2] → [Edit]

When to use: Tasks that decompose into fixed, predictable subtasks. Translation pipelines, content generation with review steps, multi-stage data processing.

When not to use: Tasks where the steps can’t be known in advance, or where later steps might invalidate earlier ones.

2. Routing

A classifier examines the input and directs it to the appropriate specialized handler. Like a hospital triage nurse — assess, classify, route.

[User message] → [Classifier] → Sales agent
                              → Support agent
                              → Technical agent

When to use: Inputs that fall into distinct categories requiring different handling. Customer service, document processing, multi-domain assistants.

The trap: Over-routing. If your classifier needs to distinguish between 15 categories, it will misclassify frequently. Keep categories to 3-5 for reliable routing.

3. Parallelization

Run multiple agents simultaneously on different aspects of the same input, then combine results. Two variants: sectioning (divide work by subtask) and voting (same task, multiple attempts, pick the best).

                 ┌→ [Grammar check]  ─┐
[Document] ──────┼→ [Fact check]     ─┼→ [Merge results]
                 └→ [Style check]    ─┘

When to use: Tasks where independent subtasks can run concurrently, or where confidence is improved by multiple attempts. Code review, content moderation, multi-criteria evaluation.

Cost awareness: Parallelization multiplies your API costs linearly. Three parallel agents cost 3x. Make sure the quality improvement justifies the expense.

4. Orchestrator-Workers

A central orchestrator agent dynamically breaks down the task and delegates to worker agents. Unlike prompt chaining, the subtasks aren’t predetermined — the orchestrator decides what to delegate based on the specific input.

[Orchestrator] → "I need these 3 things done"
     ├→ [Worker A: research]
     ├→ [Worker B: analysis]
     └→ [Worker C: writing]
          ↓
[Orchestrator] → "Now combine the results"

When to use: Complex tasks where the decomposition varies per input. Coding agents that need to read files, plan changes, implement, and test — but which files and what changes depend on the task.

The danger: Orchestrator drift. The orchestrator agent is itself an LLM, subject to the same attention degradation as any other model. If it loses track of the overall plan, workers execute correctly on the wrong tasks. Keep orchestrator context small and plan explicit.

5. Evaluator-Optimizer

One agent generates output; another evaluates it. The evaluation feeds back, and the generator iterates. This loop continues until the evaluator is satisfied or a maximum iteration count is hit.

[Generator] → [Output] → [Evaluator: score 6/10] → [Generator: improve] → [Evaluator: 9/10] → Done

When to use: Tasks with clear, measurable quality criteria. Code that must pass tests, content that must meet a rubric, translations that must match a reference style.

Always set a maximum iteration count. Without it, a generator-evaluator pair can loop indefinitely, each iteration making marginal changes that never satisfy the evaluator. I’ve seen this consume $50+ in API calls before anyone noticed.

The Handoff Problem

The patterns above describe how agents collaborate. But the hardest engineering problem in multi-agent systems isn’t collaboration — it’s handoffs: the moment when one agent passes control, context, and responsibility to another.

A bad handoff looks like this: Agent A has been helping a user with a billing issue for 5 turns. Agent A realizes the problem is actually technical and routes to Agent B. Agent B has no context. It asks the user to re-explain the problem. The user is frustrated. This is worse than having no handoff at all.

A good handoff preserves three things:

Context — the full conversation history, not just a summary
State — what has been tried, what has failed, what the user’s emotional state is
Intent — why the handoff is happening and what the receiving agent should do

OpenAI’s Agents SDK (evolved from the experimental Swarm project) implements two handoff patterns:

The Manager Pattern — a central agent invokes sub-agents as tools, retaining control of the conversation. The sub-agent executes a focused task and returns results to the manager. The user only ever talks to the manager.

The Handoff Pattern — one agent fully transfers control to another. The receiving agent takes over the conversation directly. The original agent is no longer involved.

The manager pattern is safer (the manager maintains continuity) but creates a bottleneck. The handoff pattern is more flexible but requires careful context transfer. In practice, most production systems use the manager pattern for reliability, with handoffs reserved for clear domain boundaries (e.g., sales → support → technical).

Here’s what a clean handoff implementation looks like:

from agents import Agent, Runner

triage = Agent(
    name="Triage",
    instructions="Classify the user's need. Route to sales for purchases, "
                 "support for issues. Always transfer full context.",
    handoffs=["sales_agent", "support_agent"],
)

sales_agent = Agent(
    name="Sales",
    instructions="Handle purchase inquiries. If the user has a technical "
                 "issue, transfer back to triage with context.",
    handoffs=["triage"],
)

support_agent = Agent(
    name="Support",
    instructions="Resolve customer issues. Offer refunds only after "
                 "attempting a fix. Escalate to human if unresolved "
                 "after 3 attempts.",
    handoffs=["triage"],
)

The key design choice: bounded handoff depth. An agent should hand off to at most 2-3 specialists, and handoff chains should never exceed 3 hops. Triage → Sales → Triage is fine. Triage → Sales → Support → Technical → Billing → Triage is a system that will produce circular delegation, infinite loops, and confused users.

Where Multi-Agent Systems Fail

Multi-agent architectures introduce failure modes that don’t exist in single-agent systems. Understanding these is more important than understanding the successes.

A systematic analysis of multi-agent failures (Cemri et al., 2025) studied 150+ conversation traces across five production frameworks (MetaGPT, ChatDev, AutoGen, and others) and found 41-86.7% failure rates in production. They identified 14 distinct failure modes in three categories: specification errors (42%), inter-agent misalignment (37%), and verification failures (21%). The finding that should keep you up at night: unstructured multi-agent networks can amplify errors up to 17.2x compared to single-agent baselines.

Cascading errors. When Agent A produces subtly wrong output and passes it to Agent B, Agent B has no way to know the input is wrong. It processes it faithfully, amplifying the error. Agent C does the same. By the end of the chain, the output is confidently, thoroughly wrong — and every agent’s logs show correct execution. This is the 17.2x amplification in action.

Cost explosion. Each agent in the network makes its own LLM API calls. A 5-agent workflow processing 1,000 requests per day, with an average of 3 turns per agent, generates 15,000 API calls daily. At $0.01 per call (GPT-4o-mini), that’s $150/day — $4,500/month. With GPT-4o or Claude Opus, multiply by 10-50x. Multi-agent systems can be dramatically more expensive than single-agent alternatives, and the cost scales with the number of agents, not the quality of output.

Accountability gaps. When five agents collaborate on a decision and the decision is wrong, which agent is responsible? In practice, no one debugs a multi-agent failure by reading one agent’s logs — you need to trace the full decision chain across all agents. This requires end-to-end tracing infrastructure, and most teams don’t build it until after their first expensive failure.

Coordination overhead. Every handoff, every parallel merge, every evaluator loop adds latency. A prompt chain of 4 agents, each taking 2 seconds, adds 8 seconds of sequential latency — enough to make interactive applications feel sluggish. Parallelization helps but doesn’t eliminate the coordination cost.

The “too many agents” trap. Teams that discover multi-agent patterns often over-apply them. A 3-agent customer service system (triage, sales, support) is clean and maintainable. A 15-agent system with specialized agents for “greeting,” “empathy,” “product recommendation,” “upselling,” and “farewell” is over-engineered and harder to debug than the monolithic system it replaced. Andrew Ng’s practical guidance on agentic design patterns emphasizes starting with the simplest agent architecture that solves the problem.

When to Use What

The decision tree is simpler than most frameworks suggest:

For context on what “works in production” actually looks like: Salesforce’s Agentforce, the largest documented agentic deployment, handled 1.5 million+ support requests in its first year with an 84% resolution rate without human intervention and only a 4% handoff rate. But it’s a carefully constrained workflow system, not an autonomous agent swarm.

Use a single agent when: the task is straightforward, fits in one context window, requires one domain of expertise, and doesn’t need collaboration with other systems. This covers 80% of real-world use cases.

Use prompt chaining when: the task has clear sequential steps with validation gates between them. This is the first multi-agent pattern you should reach for.

Use routing when: inputs fall into 3-5 distinct categories that need genuinely different handling. Not different tones — different tools, different knowledge, different procedures.

Use parallelization when: you need speed (run subtasks concurrently) or confidence (multiple attempts at the same task). Be prepared for the cost multiplier.

Use orchestrator-workers when: the task decomposition varies per input and can’t be predetermined. This is the most powerful pattern and the most dangerous — orchestrator drift is real.

Use evaluator-optimizer when: you have clear, measurable quality criteria and the cost of iteration is justified by the cost of a wrong answer.

Use full handoff-based multi-agent systems when: you need agents with genuinely different capabilities (different tools, different models, different system prompts) that must transfer control based on runtime decisions. This is the most complex pattern and should be your last resort, not your first.

Building for Collaboration

If you’re building multi-agent systems today, a few hard-won lessons:

Design the handoffs first. Before you build any agent, design the handoff protocol. What context gets transferred? What state is preserved? What triggers a handoff? The handoff design determines whether your system feels seamless or fragmented.

Trace everything. Build end-to-end tracing before you build the agents. Every handoff, every tool call, every decision should be logged with full context. You will need this when things go wrong — and they will go wrong.

Start with workflows, not agents. Predefined code paths with LLM steps are more reliable, more debuggable, and cheaper than autonomous agents. Only use true autonomous agents when the task genuinely can’t be decomposed in advance.

Measure the right thing. Don’t measure how many agents you have. Measure task completion rate, average cost per task, and time-to-resolution. A 2-agent system that solves 95% of problems for $0.05 each is better than a 10-agent system that solves 98% for $2.50 each.

The dance of AI agents isn’t about choreographing ever-more-complex routines. It’s about knowing when a simple two-step is better than a waltz — and having the discipline to keep it simple until simplicity genuinely isn’t enough.

Sharad Jain builds agentic AI systems in Bengaluru. He previously worked on data infrastructure at Meta and founded autoscreen.ai, a production voice AI platform. He writes about agent architecture at sharadja.in.

Further reading:

Anthropic, “Building Effective Agents” — the definitive taxonomy of agent workflow patterns
OpenAI, Agents SDK — production agent framework with handoffs and guardrails
Andrew Ng, “Agentic Design Patterns” — four foundational patterns for agent systems
OpenAI, “A Practical Guide to Building Agents” — enterprise agent architecture guide