Why multi-agent AI systems fail at context

Jitpal Kocher · · Updated May 12, 2026 · 8 min read

Key takeaway

Up to 86.7% of multi-agent AI runs fail according to UC Berkeley's MAST study of 1,600+ traces across seven frameworks, and most failures map to three context problems: bleed (one agent's state contaminates another's reasoning), explosion (cascading inheritance of full histories), and drift (agents acting on stale state). Google, Anthropic, and LangChain have converged on the same fix: agents should communicate compressed results, not raw context. Anthropic's multi-agent research system outperformed single-agent Claude Opus 4 by 90.2% by giving each sub-agent its own scoped window and returning only summaries to the parent.

Researchers at UC Berkeley analyzed over 1,600 traces across seven popular multi-agent frameworks, including OpenHands, MetaGPT, and CrewAI. They found failure rates as high as 86.7%. The failures clustered into 14 distinct modes across three categories: system design issues, inter-agent misalignment, and task verification failures.

The instinct is to blame the models or the frameworks. But the pattern across these failures tells a different story. Most breakdowns happen not because individual agents are incapable, but because the system doesn’t manage how context flows between them. This is a context engineering problem, and solving it is the difference between multi-agent systems that demo well and ones that work in production.

The coordination problem

Multi-agent systems break in ways single-agent systems never can, because every handoff between agents creates a new opportunity for too much, too little, or stale context to be passed forward. Single-agent systems have one context window to manage; multi-agent systems have many, and the interactions between them create failure modes that don’t exist in simpler architectures.

When Agent A completes a task and hands off to Agent B, several things can go wrong. Agent B might receive Agent A’s entire conversation history (thousands of irrelevant tokens) instead of a focused summary. Agent B might miss critical context that Agent A discovered but didn’t surface. Or Agent B might receive stale information from an earlier stage that has since been superseded.

Gartner reports a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025, and predicts that 40% of enterprise applications will feature task-specific agents by 2026 (up from less than 5% in 2025). But they also predict over 40% of agentic AI projects will be canceled by 2027. The gap between adoption and success is a context management gap.

Three ways multi-agent context breaks

The UC Berkeley MAST taxonomy identifies 14 failure modes. Most map to three underlying context problems.

Context bleed

Context bleed occurs when one agent’s state contaminates another’s reasoning. A planning agent that sees raw tool outputs meant for an execution agent may get confused by irrelevant details. A summarization agent that inherits a research agent’s full search history processes thousands of tokens that have nothing to do with its task.

This is the multi-agent version of context rot. In single-agent systems, performance degrades as the context window fills with noise. In multi-agent systems, context rot compounds: each agent adds its own noise, and downstream agents inherit the accumulated mess.

Context explosion

Multi-agent systems amplify token consumption through cascading context. When a root agent passes its full history to a sub-agent, and that sub-agent does the same to its own sub-agents, token counts grow exponentially. Five agents with 50-message histories can consume 400-500 MB of memory in frameworks like AutoGen. The cost adds up: every misinterpretation or hallucination caused by an overloaded context window is wasted compute.

JetBrains and NeurIPS 2025 research found that applying context management techniques (summarization, observation masking) reduces costs by roughly 50% without significantly degrading task performance. The problem is that most multi-agent deployments don’t apply these techniques at all.

Context drift

Context drift happens when old goals and task state linger after conditions have changed. An agent spawned to handle a sub-task may complete its work, but the orchestrating agent continues operating on assumptions from before the sub-task’s results came back. Or an agent receives instructions based on data that was current when the workflow started but has since been updated.

In long-running agent systems, context drift compounds with every step. The agent solves yesterday’s problem with today’s data, or today’s problem with yesterday’s context. Without explicit mechanisms to validate freshness, multi-agent workflows silently diverge from reality.

What the leading frameworks are doing

Google, Anthropic, and LangChain have all converged on the same fix: agents should communicate compressed results, not raw context. The teams building production multi-agent systems treat handoffs as compression boundaries, not as a place to dump the parent’s full state.

Google ADK: compiled context views

Google’s Agent Development Kit organizes context into four tiers: working context (the immediate prompt), session (the conversation log), memory (long-term knowledge), and artifacts (large data files referenced by name, not pasted inline). The key design principle is that working context is a “compiled view” over these sources. Every model call sees the minimum context required, and agents reach for additional information explicitly via tools rather than being flooded by default.

Their multi-agent guidance is explicit: “Share memory by communicating, don’t communicate by sharing memory.” When Agent A hands off to Agent B, the context is re-cast and scoped, not dumped wholesale.

Anthropic: write, select, compress, isolate

Anthropic’s context engineering guide distills multi-agent context management into four operations. Write context to external storage (scratchpads, files) so it doesn’t consume the window. Select only what’s relevant for the current step. Compress by summarizing intermediate results. Isolate by giving each sub-agent its own context window scoped to its specific task.

The results are striking. In their multi-agent research system, each sub-agent explores extensively (tens of thousands of tokens) but returns only a condensed summary (1,000-2,000 tokens). The system outperformed single-agent Claude Opus 4 by 90.2% on internal research evaluations.

LangChain: filesystem offloading

LangChain’s Deep Agents framework addresses context explosion by giving agents filesystem tools to offload large context outside the active window. Sub-agents handle exploratory work independently and return only their final output to the parent agent. The parent never sees the 20 tool calls that produced the result.

Patterns that work in production

Five patterns recur across the production multi-agent systems that hold up: scope context per agent, summarize at handoff boundaries, use tiered storage, enforce permission boundaries, and validate context freshness. Each one addresses a specific failure mode from the MAST taxonomy, and together they map the gap between systems that demo well and systems that survive real traffic.

Scope context per agent. Each agent should receive only the context relevant to its task. This is the multi-agent application of context isolation, one of the core context engineering techniques. Instead of inheriting the full conversation, each agent gets scoped instructions, relevant data, and nothing else.

Summarize at handoff boundaries. When one agent passes work to another, the handoff should include a summary, not the raw history. Anthropic’s 90.2% improvement over single-agent comes largely from this pattern: sub-agents compress tens of thousands of tokens into focused results.

Use tiered storage. Not all context belongs in the active window. Long-term knowledge, large artifacts, and historical data should live in external storage that agents query as needed. This is the same principle behind RAG, applied at the agent architecture level rather than the retrieval level.

Enforce permission boundaries. In multi-agent systems, context scoping is also a security concern. An agent processing customer data shouldn’t see engineering infrastructure. An agent handling billing shouldn’t access HR records. Context containers with per-agent access controls prevent both context bleed and unauthorized access. Platforms like Wire apply this through isolated containers, each with its own access controls and MCP tools, so agents query only what they’re authorized to see.

Validate context freshness. Before acting on inherited context, agents should verify that the information is still current. This can be as simple as timestamp checks on retrieved data, or as sophisticated as re-querying sources when the context is older than a threshold.

What this means for your architecture

The 86.7% failure rate in multi-agent systems is not a verdict on the technology. It reflects how early most teams are in treating context as an architectural concern rather than an afterthought.

The teams getting multi-agent right, Google, Anthropic, LangChain, have all arrived at the same conclusion: the hard problem is not building individual agents, it is designing how context flows between them. Scope it, compress it, isolate it, validate it. The agents themselves are the easy part.

References

Frequently asked questions

Why does multi-agent context break in ways single-agent doesn't?
Single-agent systems have one context window to manage. Multi-agent systems have many, and every handoff creates the chance for too much, too little, or stale context to be passed forward. The 14 failure modes UC Berkeley's MAST taxonomy identifies are mostly inter-agent problems, not problems any single agent has on its own.
What's the difference between context bleed and context explosion?
Context bleed is qualitative: one agent's irrelevant state contaminates another's reasoning even though the token count is manageable. Context explosion is quantitative: full conversation histories cascade from parent to sub-agents to their sub-agents until token counts grow exponentially. Bleed degrades accuracy; explosion drives up cost and triggers context rot.
How do I prevent one agent's history from polluting another's reasoning?
Pass summaries at handoff boundaries instead of raw history. Anthropic's multi-agent research system has each sub-agent explore in tens of thousands of tokens but return only a 1,000-2,000 token condensed summary to the parent. The parent never sees the intermediate tool calls or scratch work. This single pattern accounts for most of the 90.2% improvement over single-agent Claude Opus 4 they report.
Does adding more agents always cost more tokens?
By default, yes, and exponentially. Five agents in AutoGen with 50-message histories can consume 400-500 MB of memory. JetBrains and NeurIPS 2025 research found context-management techniques like summarization and observation masking cut costs roughly 50% with no significant accuracy loss. Without those techniques, multi-agent systems pay full cost on every handoff.
When should I use a single agent instead of a multi-agent system?
When the task has one clear objective, one data scope, and no genuine specialization across roles. Multi-agent systems pay coordination cost on every step, and that cost is only worth it when scoped sub-tasks let each agent operate in a smaller, cleaner context than a single agent could. If a single agent with good context engineering can do the work, the multi-agent setup is usually overhead.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container