Context Offloading Context Window Multi-Agent System Context Engineering

Context offloading: 3 patterns for AI agents

Jitpal Kocher · May 22, 2026 · 10 min read

Key takeaway

Context offloading is the practice of keeping an AI agent's working context window small by moving state to an external destination and pulling it back only when a step needs it. The three production patterns are structured note-taking, sub-agent delegation, and just-in-time retrieval, each sending a different kind of bulk somewhere the agent can recover it. Offloading matters because transformer attention is a fixed budget: accuracy drops from 95% to 60-70% as that budget is diluted across more tokens. It is different from reduction techniques like compaction, which shrink the window in place rather than relocating anything.

A long-running AI agent fills its own context window. Every tool call returns output, every step adds reasoning, every retrieved document stays in the history. By the time the agent is fifty turns deep, most of the window is exhaust: superseded plans, raw tool dumps, observations that mattered once and never again.

Context offloading is how production teams keep the state that matters out of that pile. Context offloading is the practice of keeping an agent’s working context window small by moving state to a destination outside it and pulling it back only when a step needs it. It is one operational answer to a measured problem: transformer attention is a fixed budget, and accuracy falls as that budget spreads thinner. This guide covers the three offloading patterns used in production, where each one sends its bulk, and the cost of getting each one wrong.

What context offloading is

Context offloading keeps an agent’s working window small by moving state to a destination outside it and bringing it back on demand. The agent carries lightweight references, such as file paths, container IDs, or search queries, instead of the full payloads, and reconstructs detail only for the step that needs it. This is different from buying a bigger window: the goal is a smaller working set, not a larger ceiling.

Offloading is one of two ways to keep a window small, and it pays to be precise about which one you are doing. Reduction techniques, such as compaction, shrink what is already in the window in place: they summarize a long history into a short one, and nothing leaves for a destination the agent can return to. Offloading relocates: the state lives in a file, a sub-agent, or a retrievable store, and the window holds only a reference to it. Context compression covers the reduction side. This guide is about offloading, where the defining question is not how small you made the context but where you put it and how reliably you get it back.

The reason to bother is that context rot is real and measured. Chroma’s study across 18 frontier models found accuracy dropping from 95% to 60-70% as input length grew, even on simple tasks. Separate analysis of leading models puts effective context capacity at roughly 60-70% of the advertised maximum, so a model claiming 200,000 tokens becomes unreliable well before it is full. The window is not free storage. Every token an agent carries that it does not need is a token of attention pulled away from the tokens it does.

The three context offloading patterns

Three offloading patterns have converged in production AI systems: structured note-taking, sub-agent delegation, and just-in-time retrieval. Each sends a different kind of bulk to a different destination: durable facts to a file, exploration work to a sub-agent, and reference data to a store the agent queries on demand. What unites them is that the bulk leaves the window for somewhere the agent can recover it, rather than being summarized away. They are not mutually exclusive, and most serious agents run all three together.

Pattern	What it offloads, and where	Best for	Main cost
Structured note-taking	Durable facts and decisions, to a file	Goals that must survive many steps	Notes go stale if not updated
Sub-agent delegation	Exploration and search work, to a sub-agent	Wide tasks with parallel branches	Coordination overhead, handoff loss
Just-in-time retrieval	Reference data, to a store queried on demand	Large corpora the agent works against	Retrieval latency, wrong-chunk risk

Structured note-taking

Structured note-taking has the agent write durable facts to a file outside the window and read them back when needed. Anthropic calls this agentic memory: a technique where the agent regularly writes notes persisted to memory outside of the context window. In Anthropic’s Pokemon-playing agent, a single note tracked an objective like “for the last 1,234 steps I’ve been training my Pokemon in Route 1, Pikachu has gained 8 levels toward the target of 10,” one line standing in for thousands of turns of history.

The pattern works because the note is the agent’s own summary, written deliberately at the moment the agent knows what matters. It survives context resets, sub-agent handoffs, and full session restarts, because the file outlives any single window. The failure mode is staleness: a note written once and never revised becomes context poisoning waiting to happen, because the agent treats it as ground truth long after it stops being true. Notes need an update discipline, not just a write step. Structured context, with explicit sections for goal, decisions, and open questions, reads back far more reliably than a freeform log, because the agent can update one section without rewriting the whole file.

Sub-agent delegation

Sub-agent delegation moves exploration work into a separate agent with its own clean window, returning only a distilled result to the caller. Anthropic’s guidance is concrete: each subagent might explore extensively, using tens of thousands of tokens or more, but returns only a condensed summary of its work, often 1,000 to 2,000 tokens. The parent agent never sees the search dead-ends, the discarded files, or the raw tool output. It sees the conclusion.

This is the highest-leverage offloading pattern for wide tasks. Anthropic’s multi-agent research system outperformed single-agent Claude Opus 4 by 90.2% on its internal evaluation, largely because each sub-agent got a scoped window instead of one agent drowning in every branch at once. The cost is coordination: sub-agents that share context badly hit the failure modes documented in why multi-agent AI systems fail at context, where one agent’s state bleeds into another’s reasoning or a handoff drops a constraint. The rule that makes delegation an offloading technique is the same rule that makes it work at all: agents communicate compressed results, never raw context.

Just-in-time retrieval

Just-in-time retrieval keeps reference data out of the window entirely and loads it at runtime through tools. Instead of stuffing documents into the prompt, the agent holds lightweight identifiers, such as file paths, URLs, or record IDs, and calls a tool to fetch content for the step that needs it. Anthropic frames this as mirroring how people work: nobody memorizes an entire corpus, they retrieve from it on demand. Claude Code applies the same idea with file paths, reading a file only when the current task touches it.

This is where a context platform earns its place. An agent connected to a Wire container holds only a container reference and calls wire_search or wire_navigate to pull the entries a step needs, and because the container processes files into structured entries at upload time, a retrieval returns a handful of relevant entries rather than a whole document. The working window stays in the low thousands of tokens even when the container holds millions of entries. The same principle applied to tool definitions is progressive tool loading, which can cut a 150,000-token tool catalog to around 2,000 by deferring definitions until they are called. The cost of just-in-time retrieval is latency and precision: a slow or wrong fetch is worse than carrying the data, so the retrieval layer has to be both fast and accurate.

What offloading costs you

Every offloading pattern trades token savings for a specific risk, and the risks are predictable enough to budget for. Four costs are worth planning around before you turn a pattern on:

Retrieval latency. State at a destination has to be fetched back. Just-in-time retrieval and sub-agent delegation both add round trips, and an agent that waits on a slow store for every step can end up slower overall than one that carried the data. The fix is a retrieval layer fast enough that the round trip is cheaper than the attention tax of a bloated window. Note-taking mostly avoids this, since reading a local file is close to free.
Lossy handoffs. A note and a sub-agent summary are both compressions, and compression discards. The danger is not that they drop tokens, it is that they drop the wrong ones: a lost file path, a missing error message, or a ruled-out approach that goes unrecorded sends the agent to repeat work it already finished. Test what survives the handoff, not just how small the result is.
Conflicting copies. Offloading creates more than one place where state lives: the window, a file, a sub-agent’s report, an index. Those copies can disagree. A note says an API takes two arguments, a freshly retrieved doc says three, and the agent has no built-in way to know which is current. The more you offload, the more you need a single source of truth and a clear rule for which destination wins.
The offload-and-forget failure. The worst outcome is state that gets sent to a destination and never retrieved. A note the agent stops consulting, a sub-agent summary that omits the one detail the parent needed, a document indexed but never surfaced by a query. Offloading only helps if the retrieval path is as reliable as the write path, which means the destination has to be searchable and the agent needs a reason to look.

These costs are why offloading is an engineering decision, not a default. Each pattern is a bet that the round trip is cheaper than the attention tax, and that bet is wrong for short tasks where the whole job fits in the window with room to spare.

How to choose a pattern

Choose an offloading pattern by identifying which kind of bulk dominates the agent’s window. The three patterns map to three distinct sources of bloat:

If the agent loses the thread of its goal across steps or resets, add structured note-taking so the objective survives outside the window.
If the task is wide, with many independent branches, delegate the branches to sub-agents.
If the agent works against a large corpus, use just-in-time retrieval so the corpus never enters the window at all.

One common source of bloat is deliberately not on that list: a long conversation history. That is a reduction problem, not an offloading one, and the tool for it is compaction, covered in context compression. Offloading and reduction are complementary, and most long-running agents need both.

Most production agents use all three offloading patterns at once: note-taking for durable goals, sub-agents for exploration, and retrieval for reference data. The patterns compose because they target non-overlapping bulk and send it to non-overlapping destinations. The discipline that ties them together is treating the window as a context budget with a fixed ceiling, then deciding deliberately what earns a place inside it. Offloading is one technique in a larger toolkit; the broader set of context engineering techniques covers the retrieval and validation work that makes offloaded state trustworthy when it comes back.

The window is not storage

The instinct when an agent struggles is to give it more room. The teams shipping the most capable agents in 2026 do the opposite. They keep the working window small on purpose and treat every token in it as something that earned its place.

Context offloading is one half of how that discipline becomes mechanical. Note-taking sends durable goals to a file, sub-agents send exploration to their own windows, and just-in-time retrieval keeps reference data in a store until the moment it is needed. Reduction techniques like compaction handle the other half, the running conversation itself. Neither family is exotic, and neither is free: offloading trades a round trip and a lossy handoff for a window that stays in the range where attention actually holds. The agent that works best is not the one carrying the most context. It is the one carrying exactly what the current step needs, with everything else one fetch away.

Sources: Anthropic: Effective Context Engineering for AI Agents · Anthropic: Code Execution with MCP · Anthropic: Multi-Agent Research System · Chroma: Context Rot · AI Context Windows: Why Bigger Isn’t Always Better

Frequently asked questions

Is compaction a form of context offloading?

No. Compaction summarizes a conversation in place and restarts the window with the summary, so nothing moves to a destination the agent can return to. Context offloading relocates state to a file, a sub-agent, or an external store. Compaction reduces the window; offloading relocates what fills it.

Which context offloading pattern should I use?

It depends on which kind of bulk fills the window. Use structured note-taking when the agent loses track of its goal across steps, sub-agent delegation when a task has many independent branches, and just-in-time retrieval when the agent works against a large corpus. Most production agents combine all three.

Does context offloading reduce AI agent accuracy?

Done well, it raises accuracy, because a smaller working window concentrates attention on high-signal tokens. The risk is a bad handoff: a note, a sub-agent summary, or a retrieval that drops the detail the next step needs forces the agent to rediscover it. Accuracy depends on what the offload preserves, not on how many tokens it removes.

How much can context offloading cut token usage?

It depends on the pattern. A sub-agent can explore using tens of thousands of tokens and return a summary of just 1,000 to 2,000, and progressive tool loading, a form of just-in-time retrieval, can cut a 150,000-token tool catalog to around 2,000. Savings compound when several patterns run together.

When does context offloading make an agent slower?

Just-in-time retrieval and sub-agent delegation both add round trips: the agent has to fetch data or wait for a sub-agent before it can continue. If the retrieval layer is slow or returns the wrong content, offloading can cost more time than it saves. Structured note-taking adds almost no latency, because reading a local file is close to free.

Agent Drift AI Agent

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container