Agent drift: why long-running AI agents lose the plot
Key takeaway
Context offloading is the practice of keeping an AI agent's working context window small by moving state to an external destination and pulling it back only when a step needs it. The three production patterns are structured note-taking, sub-agent delegation, and just-in-time retrieval, each sending a different kind of bulk somewhere the agent can recover it. Offloading matters because transformer attention is a fixed budget: accuracy drops from 95% to 60-70% as that budget is diluted across more tokens. It is different from reduction techniques like compaction, which shrink the window in place rather than relocating anything.
A long-running AI agent fills its own context window. Every tool call returns output, every step adds reasoning, every retrieved document stays in the history. By the time the agent is fifty turns deep, most of the window is exhaust: superseded plans, raw tool dumps, observations that mattered once and never again.
Context offloading is how production teams keep the state that matters out of that pile. Context offloading is the practice of keeping an agent’s working context window small by moving state to a destination outside it and pulling it back only when a step needs it. It is one operational answer to a measured problem: transformer attention is a fixed budget, and accuracy falls as that budget spreads thinner. This guide covers the three offloading patterns used in production, where each one sends its bulk, and the cost of getting each one wrong.
Context offloading keeps an agent’s working window small by moving state to a destination outside it and bringing it back on demand. The agent carries lightweight references, such as file paths, container IDs, or search queries, instead of the full payloads, and reconstructs detail only for the step that needs it. This is different from buying a bigger window: the goal is a smaller working set, not a larger ceiling.
Offloading is one of two ways to keep a window small, and it pays to be precise about which one you are doing. Reduction techniques, such as compaction, shrink what is already in the window in place: they summarize a long history into a short one, and nothing leaves for a destination the agent can return to. Offloading relocates: the state lives in a file, a sub-agent, or a retrievable store, and the window holds only a reference to it. Context compression covers the reduction side. This guide is about offloading, where the defining question is not how small you made the context but where you put it and how reliably you get it back.
The reason to bother is that context rot is real and measured. Chroma’s study across 18 frontier models found accuracy dropping from 95% to 60-70% as input length grew, even on simple tasks. Separate analysis of leading models puts effective context capacity at roughly 60-70% of the advertised maximum, so a model claiming 200,000 tokens becomes unreliable well before it is full. The window is not free storage. Every token an agent carries that it does not need is a token of attention pulled away from the tokens it does.
Three offloading patterns have converged in production AI systems: structured note-taking, sub-agent delegation, and just-in-time retrieval. Each sends a different kind of bulk to a different destination: durable facts to a file, exploration work to a sub-agent, and reference data to a store the agent queries on demand. What unites them is that the bulk leaves the window for somewhere the agent can recover it, rather than being summarized away. They are not mutually exclusive, and most serious agents run all three together.
| Pattern | What it offloads, and where | Best for | Main cost |
|---|---|---|---|
| Structured note-taking | Durable facts and decisions, to a file | Goals that must survive many steps | Notes go stale if not updated |
| Sub-agent delegation | Exploration and search work, to a sub-agent | Wide tasks with parallel branches | Coordination overhead, handoff loss |
| Just-in-time retrieval | Reference data, to a store queried on demand | Large corpora the agent works against | Retrieval latency, wrong-chunk risk |
Structured note-taking has the agent write durable facts to a file outside the window and read them back when needed. Anthropic calls this agentic memory: a technique where the agent regularly writes notes persisted to memory outside of the context window. In Anthropic’s Pokemon-playing agent, a single note tracked an objective like “for the last 1,234 steps I’ve been training my Pokemon in Route 1, Pikachu has gained 8 levels toward the target of 10,” one line standing in for thousands of turns of history.
The pattern works because the note is the agent’s own summary, written deliberately at the moment the agent knows what matters. It survives context resets, sub-agent handoffs, and full session restarts, because the file outlives any single window. The failure mode is staleness: a note written once and never revised becomes context poisoning waiting to happen, because the agent treats it as ground truth long after it stops being true. Notes need an update discipline, not just a write step. Structured context, with explicit sections for goal, decisions, and open questions, reads back far more reliably than a freeform log, because the agent can update one section without rewriting the whole file.
Sub-agent delegation moves exploration work into a separate agent with its own clean window, returning only a distilled result to the caller. Anthropic’s guidance is concrete: each subagent might explore extensively, using tens of thousands of tokens or more, but returns only a condensed summary of its work, often 1,000 to 2,000 tokens. The parent agent never sees the search dead-ends, the discarded files, or the raw tool output. It sees the conclusion.
This is the highest-leverage offloading pattern for wide tasks. Anthropic’s multi-agent research system outperformed single-agent Claude Opus 4 by 90.2% on its internal evaluation, largely because each sub-agent got a scoped window instead of one agent drowning in every branch at once. The cost is coordination: sub-agents that share context badly hit the failure modes documented in why multi-agent AI systems fail at context, where one agent’s state bleeds into another’s reasoning or a handoff drops a constraint. The rule that makes delegation an offloading technique is the same rule that makes it work at all: agents communicate compressed results, never raw context.
Just-in-time retrieval keeps reference data out of the window entirely and loads it at runtime through tools. Instead of stuffing documents into the prompt, the agent holds lightweight identifiers, such as file paths, URLs, or record IDs, and calls a tool to fetch content for the step that needs it. Anthropic frames this as mirroring how people work: nobody memorizes an entire corpus, they retrieve from it on demand. Claude Code applies the same idea with file paths, reading a file only when the current task touches it.
This is where a context platform earns its place. An agent connected to a Wire container holds only a container reference and calls wire_search or wire_navigate to pull the entries a step needs, and because the container processes files into structured entries at upload time, a retrieval returns a handful of relevant entries rather than a whole document. The working window stays in the low thousands of tokens even when the container holds millions of entries. The same principle applied to tool definitions is progressive tool loading, which can cut a 150,000-token tool catalog to around 2,000 by deferring definitions until they are called. The cost of just-in-time retrieval is latency and precision: a slow or wrong fetch is worse than carrying the data, so the retrieval layer has to be both fast and accurate.
Every offloading pattern trades token savings for a specific risk, and the risks are predictable enough to budget for. Four costs are worth planning around before you turn a pattern on:
These costs are why offloading is an engineering decision, not a default. Each pattern is a bet that the round trip is cheaper than the attention tax, and that bet is wrong for short tasks where the whole job fits in the window with room to spare.
Choose an offloading pattern by identifying which kind of bulk dominates the agent’s window. The three patterns map to three distinct sources of bloat:
One common source of bloat is deliberately not on that list: a long conversation history. That is a reduction problem, not an offloading one, and the tool for it is compaction, covered in context compression. Offloading and reduction are complementary, and most long-running agents need both.
Most production agents use all three offloading patterns at once: note-taking for durable goals, sub-agents for exploration, and retrieval for reference data. The patterns compose because they target non-overlapping bulk and send it to non-overlapping destinations. The discipline that ties them together is treating the window as a context budget with a fixed ceiling, then deciding deliberately what earns a place inside it. Offloading is one technique in a larger toolkit; the broader set of context engineering techniques covers the retrieval and validation work that makes offloaded state trustworthy when it comes back.
The instinct when an agent struggles is to give it more room. The teams shipping the most capable agents in 2026 do the opposite. They keep the working window small on purpose and treat every token in it as something that earned its place.
Context offloading is one half of how that discipline becomes mechanical. Note-taking sends durable goals to a file, sub-agents send exploration to their own windows, and just-in-time retrieval keeps reference data in a store until the moment it is needed. Reduction techniques like compaction handle the other half, the running conversation itself. Neither family is exotic, and neither is free: offloading trades a round trip and a lossy handoff for a window that stays in the range where attention actually holds. The agent that works best is not the one carrying the most context. It is the one carrying exactly what the current step needs, with everything else one fetch away.
Sources: Anthropic: Effective Context Engineering for AI Agents · Anthropic: Code Execution with MCP · Anthropic: Multi-Agent Research System · Chroma: Context Rot · AI Context Windows: Why Bigger Isn’t Always Better
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container