Agent drift: why long-running AI agents lose the plot
Key takeaway
Sub-agent context isolation gives each agent its own context window scoped to a single task, so the orchestrator's main thread never fills with the intermediate tool calls and scratch work that cause context rot. Anthropic's multi-agent research system used this to beat single-agent Claude Opus 4 by 90.2%, and AOrchestra reports a 16.28% relative gain from spawning context-isolated executors on demand. The catch is that isolation can strand shared context: Cognition argues sub-agents acting without each other's traces make conflicting decisions, so the boundary has to be designed, not assumed.
Anthropic’s multi-agent research system beat single-agent Claude Opus 4 by 90.2% on their internal research evaluations. The architectural choice doing most of that work is not smarter agents. It is that each sub-agent runs in its own context window, explores in tens of thousands of tokens, and returns only a 1,000 to 2,000 token summary to the parent. The parent never sees the scratch work.
That pattern has a name: sub-agent context isolation. It is the deliberate decision to give each unit of work a fresh, scoped context instead of letting one thread accumulate everything. Done well, it is the most direct fix we have for context rot. Done naively, it trades one failure mode for another. This post covers what isolation actually buys you, the two ways to implement it, and the specific case where it backfires.
Sub-agent context isolation means each agent gets a context window scoped to a single task, separate from the orchestrator’s main thread and from its siblings. Anthropic’s context engineering guide lists “isolate” as one of four core operations, alongside write, select, and compress: give each sub-agent its own window scoped to its specific task. The orchestrator delegates, the sub-agent works in private, and only a compressed result crosses back.
The point is not parallelism for speed. The point is that a context window stays clean when it only ever holds what one task needs. A research sub-agent can read twenty documents and run a dozen tool calls without any of that landing in the orchestrator’s window or in a sibling agent’s reasoning. Isolation turns the handoff into a compression boundary rather than a place to dump state, which is the same conclusion the teams behind production multi-agent systems keep arriving at independently.
A single growing thread degrades because attention dilutes as the window fills, and most of what fills it is noise the model no longer needs. Every tool call, every intermediate observation, every abandoned branch stays in context and gets rebilled on the next step. By step twenty, the model is reasoning over a transcript where the signal it needs is buried under thousands of tokens of its own history.
This is context rot: accuracy falls as input length grows, even when the underlying task does not get harder. In a single-agent loop it shows up as the agent forgetting an early instruction or contradicting a decision it made ten steps ago. In a long-running agent it compounds into agent drift, where the agent keeps acting on goals and state that have since been superseded. The shared thread is the mechanism that lets stale context survive long enough to do damage.
Isolation attacks the root cause. If a sub-agent’s exploration never enters the main thread, it cannot rot the main thread. The orchestrator’s window holds the plan and a handful of distilled results, not the full trace of every sub-task. That is why isolation outperforms compression alone: compression shrinks the noise after the fact, while isolation keeps the noise out of the window that matters.
There are two ways to build sub-agents, and they make different trade-offs between specialization and adaptability. Static-role sub-agents are fixed specialists you define in advance: a planner, a coder, a reviewer, each with a hand-written prompt and tool set. Context-isolated sub-agents are spawned per task, each with a context, tools, and model chosen for that specific step. AOrchestra, an orchestration framework from early 2026, frames an agent as a compositional tuple of instruction, context, tools, and model, which lets a central orchestrator “spawn specialized executors for each task on demand” rather than maintaining a fixed roster. Across GAIA, SWE-Bench, and Terminal-Bench, that on-demand approach reported a 16.28% relative improvement over the strongest baseline.
The table below compares the common approaches to handling context across agents.
| Approach | How context flows | Strength | Failure mode |
|---|---|---|---|
| Single growing thread | Everything accumulates in one window | Full shared context, simple to build | Context rot, quadratic token cost |
| Static-role sub-agents | Each fixed role gets scoped context | Predictable, specialized behavior | Inflexible, heavy human engineering |
| Context-isolated sub-agents | Per-task window, compressed result returned | Adapts to the task, keeps the parent small | Sub-agents lose sight of each other |
| Linear single thread (Cognition) | One agent, full trace, no parallel branches | No conflicting decisions | Inherits all of context rot’s limits |
The recent literature treats isolation as a first-class property rather than an implementation detail. A March 2026 paper, “Context Engineering: From Prompts to Corporate Multi-Agent Architecture,” lists isolation as one of five context quality criteria, alongside relevance, sufficiency, economy, and provenance, and frames context as the agent’s operating system. The framing matters: if isolation is a quality dimension you measure, you design for it, instead of discovering its absence when a run fails.
Isolation breaks down when sub-agents need each other’s decisions and only receive each other’s messages. Cognition, the team behind Devin, made this the centerpiece of its essay “Don’t Build Multi-Agents.” Their argument is that every action an agent takes carries an implicit decision, and when sub-agents act in parallel without seeing each other’s full traces, those decisions conflict. Their example is a builder agent producing a game background in one visual style while a second sub-agent builds a character asset in a clashing style, because neither agent ever saw the other’s choices.
This is the real tension. The same boundary that keeps the orchestrator’s window clean also hides each sub-agent’s reasoning from its peers. Pass too much across the boundary and you reintroduce the rot you were trying to prevent. Pass too little and sub-agents make locally sensible, globally incoherent decisions. Cognition’s conclusion is to prefer a single linear thread and share full context rather than fragment it, which avoids conflicting decisions at the cost of inheriting every limitation of the shared thread.
Both camps are right about different workloads. Anthropic’s research task is read-heavy and decomposes cleanly: sub-agents gather independent facts that rarely conflict, so isolation is almost pure upside. Cognition’s coding task is write-heavy and tightly coupled: sub-agents are making interdependent design decisions, so isolation strands exactly the context they need. The lesson is not which side wins. It is that isolation helps in proportion to how independent the sub-tasks actually are.
The way to get isolation’s upside without stranding shared context is to make the boundary an explicit artifact, not a prompt instruction, and to control exactly what crosses it. Three patterns do this in production.
Return compressed results, not traces. The handoff should carry conclusions and the few facts the parent needs, while the intermediate reasoning stays in the sub-agent’s window. Anthropic’s 90.2% improvement comes largely from this: tens of thousands of tokens of exploration distilled into a short summary.
Offload detail to storage the parent references by name. When a sub-agent produces a large artifact, write it to external storage and pass a pointer rather than the content. This is context offloading applied at the agent boundary, and it keeps the main thread small while preserving access to the full result.
Give each sub-agent a scoped data boundary, not a hand-maintained filter. Instead of the orchestrator slicing one shared store and trusting each sub-agent to read only its slice, give each sub-agent its own isolated context container and let it pull only what it is scoped to through that container’s own tools. With Wire, each container is a separately permissioned environment with its own MCP server, so the isolation boundary is the container itself rather than a filter the orchestrator has to enforce by hand. That removes a whole class of context bleed, because a sub-agent physically cannot read into a sibling’s container.
These are the same context engineering techniques that work for single agents, applied at the seams between agents. Scope what each agent sees, compress what crosses boundaries, and offload what does not need to be inline.
Reach for sub-agent context isolation when a task decomposes into scoped sub-tasks that each fit in a smaller, cleaner context than the whole job would, and when those sub-tasks are genuinely independent. Research, retrieval, broad search, and parallel verification fit this shape well, which is why isolation shows the largest gains there.
Keep a single thread when the work is tightly coupled, when sub-agents would be making interdependent design decisions, or when the task has one objective over one data scope with no real specialization. In those cases the coordination cost and the risk of conflicting decisions outweigh what you save on the context window. Isolation is a context engineering tool, not a default. The question to ask before splitting a task is whether each piece can succeed knowing only its own slice. When the answer is yes, give it its own window. When the answer is no, keep it in the room.
Sources: Anthropic: How We Built Our Multi-Agent Research System · Anthropic: Effective Context Engineering for AI Agents · AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration (arXiv:2602.03786) · Context Engineering: From Prompts to Corporate Multi-Agent Architecture (arXiv:2603.09619) · Cognition: Don’t Build Multi-Agents · UC Berkeley MAST: Why Do Multi-Agent LLM Systems Fail? (arXiv:2503.13657)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container