Researchers at UC Berkeley analyzed over 1,600 traces across seven popular multi-agent frameworks, including OpenHands, MetaGPT, and CrewAI. They found failure rates as high as 86.7%. The failures clustered into 14 distinct modes across three categories: system design issues, inter-agent misalignment, and task verification failures.
The instinct is to blame the models or the frameworks. But the pattern across these failures tells a different story. Most breakdowns happen not because individual agents are incapable, but because the system doesn’t manage how context flows between them. This is a context engineering problem, and solving it is the difference between multi-agent systems that demo well and ones that work in production.
The coordination problem
Single-agent systems have one context window to manage. Multi-agent systems have many, and the interactions between them create failure modes that don’t exist in simpler architectures.
When Agent A completes a task and hands off to Agent B, several things can go wrong. Agent B might receive Agent A’s entire conversation history (thousands of irrelevant tokens) instead of a focused summary. Agent B might miss critical context that Agent A discovered but didn’t surface. Or Agent B might receive stale information from an earlier stage that has since been superseded.
Gartner reports a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025, and predicts that 40% of enterprise applications will feature task-specific agents by 2026 (up from less than 5% in 2025). But they also predict over 40% of agentic AI projects will be canceled by 2027. The gap between adoption and success is a context management gap.
Three ways multi-agent context breaks
The UC Berkeley MAST taxonomy identifies 14 failure modes. Most map to three underlying context problems.
Context bleed
Context bleed occurs when one agent’s state contaminates another’s reasoning. A planning agent that sees raw tool outputs meant for an execution agent may get confused by irrelevant details. A summarization agent that inherits a research agent’s full search history processes thousands of tokens that have nothing to do with its task.
This is the multi-agent version of context rot. In single-agent systems, performance degrades as the context window fills with noise. In multi-agent systems, context rot compounds: each agent adds its own noise, and downstream agents inherit the accumulated mess.
Context explosion
Multi-agent systems amplify token consumption through cascading context. When a root agent passes its full history to a sub-agent, and that sub-agent does the same to its own sub-agents, token counts grow exponentially. Five agents with 50-message histories can consume 400-500 MB of memory in frameworks like AutoGen. The cost adds up: every misinterpretation or hallucination caused by an overloaded context window is wasted compute.
JetBrains and NeurIPS 2025 research found that applying context management techniques (summarization, observation masking) reduces costs by roughly 50% without significantly degrading task performance. The problem is that most multi-agent deployments don’t apply these techniques at all.
Context drift
Context drift happens when old goals and task state linger after conditions have changed. An agent spawned to handle a sub-task may complete its work, but the orchestrating agent continues operating on assumptions from before the sub-task’s results came back. Or an agent receives instructions based on data that was current when the workflow started but has since been updated.
In long-running agent systems, context drift compounds with every step. The agent solves yesterday’s problem with today’s data, or today’s problem with yesterday’s context. Without explicit mechanisms to validate freshness, multi-agent workflows silently diverge from reality.
What the leading frameworks are doing
The teams building production multi-agent systems have converged on a shared insight: agents should communicate results, not raw context.
Google ADK: compiled context views
Google’s Agent Development Kit organizes context into four tiers: working context (the immediate prompt), session (the conversation log), memory (long-term knowledge), and artifacts (large data files referenced by name, not pasted inline). The key design principle is that working context is a “compiled view” over these sources. Every model call sees the minimum context required, and agents reach for additional information explicitly via tools rather than being flooded by default.
Their multi-agent guidance is explicit: “Share memory by communicating, don’t communicate by sharing memory.” When Agent A hands off to Agent B, the context is re-cast and scoped, not dumped wholesale.
Anthropic: write, select, compress, isolate
Anthropic’s context engineering guide distills multi-agent context management into four operations. Write context to external storage (scratchpads, files) so it doesn’t consume the window. Select only what’s relevant for the current step. Compress by summarizing intermediate results. Isolate by giving each sub-agent its own context window scoped to its specific task.
The results are striking. In their multi-agent research system, each sub-agent explores extensively (tens of thousands of tokens) but returns only a condensed summary (1,000-2,000 tokens). The system outperformed single-agent Claude Opus 4 by 90.2% on internal research evaluations.
LangChain: filesystem offloading
LangChain’s Deep Agents framework addresses context explosion by giving agents filesystem tools to offload large context outside the active window. Sub-agents handle exploratory work independently and return only their final output to the parent agent. The parent never sees the 20 tool calls that produced the result.
Patterns that work in production
Across these frameworks and the research, a set of practical patterns emerges for managing context in multi-agent systems.
Scope context per agent. Each agent should receive only the context relevant to its task. This is the multi-agent application of context isolation, one of the core context engineering techniques. Instead of inheriting the full conversation, each agent gets scoped instructions, relevant data, and nothing else.
Summarize at handoff boundaries. When one agent passes work to another, the handoff should include a summary, not the raw history. Anthropic’s 90.2% improvement over single-agent comes largely from this pattern: sub-agents compress tens of thousands of tokens into focused results.
Use tiered storage. Not all context belongs in the active window. Long-term knowledge, large artifacts, and historical data should live in external storage that agents query as needed. This is the same principle behind RAG, applied at the agent architecture level rather than the retrieval level.
Enforce permission boundaries. In multi-agent systems, context scoping is also a security concern. An agent processing customer data shouldn’t see engineering infrastructure. An agent handling billing shouldn’t access HR records. Context containers with per-agent access controls prevent both context bleed and unauthorized access. Platforms like Wire apply this through isolated containers, each with its own access controls and MCP tools, so agents query only what they’re authorized to see.
Validate context freshness. Before acting on inherited context, agents should verify that the information is still current. This can be as simple as timestamp checks on retrieved data, or as sophisticated as re-querying sources when the context is older than a threshold.
What this means for your architecture
The 86.7% failure rate in multi-agent systems is not a verdict on the technology. It reflects how early most teams are in treating context as an architectural concern rather than an afterthought.
The teams getting multi-agent right, Google, Anthropic, LangChain, have all arrived at the same conclusion: the hard problem is not building individual agents, it is designing how context flows between them. Scope it, compress it, isolate it, validate it. The agents themselves are the easy part.
References
- UC Berkeley MAST: Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657)
- Google: Architecting Efficient Context-Aware Multi-Agent Framework
- Google ADK: Context Documentation
- Anthropic: Effective Context Engineering for AI Agents
- Anthropic: How We Built Our Multi-Agent Research System
- LangChain: Building Multi-Agent Applications with Deep Agents
- Gartner: Multiagent Systems in Enterprise AI
- JetBrains: Efficient Context Management (NeurIPS 2025)
- Inkeep: Context Engineering: Why Agents Fail
- Multi-Agent Memory Architecture (arXiv:2603.10062)