Why AI agents forget mid-task (and how to fix it)

Jitpal Kocher · · 6 min read

Key takeaway

Context compression is the practice of reducing token count in long-running AI agent sessions while preserving the information agents need to complete tasks. Research shows 65% of enterprise agent failures come from context drift during multi-step reasoning, not from hitting token limits. Effective compression techniques like anchored summarization and tool response offloading keep agents focused by curating what stays in the context window rather than expanding it.

Your AI agent starts a complex task, works through the first few steps perfectly, then somewhere around step 15 it starts repeating work it already did. By step 25, it has forgotten its own plan entirely.

This is not a hallucination problem. It is not a capability problem. According to Zylos Research, 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning, not raw context exhaustion. The agent had plenty of room in its context window. It just couldn’t effectively use what was in it.

The real problem: context drift

Every action an AI agent takes generates tokens. Tool calls produce outputs. Observations get logged. Error messages pile up. A coding agent that starts with a clear 500-token task description might accumulate 80,000 tokens of history after 30 tool calls, with the original instructions now buried deep in the context.

This is where context rot kicks in. Chroma’s research across 18 frontier models found that when the context window is less than 50% full, models lose tokens in the middle. When it is more than 50% full, they lose the earliest tokens. Either way, the information that matters most (the original goal, key decisions, constraints) is exactly what gets lost.

The gap between advertised and effective context capacity makes this worse. A model claiming 200,000 tokens typically becomes unreliable around 130,000. That is roughly 60-70% of the advertised maximum. Your agent hits degradation well before it hits the limit.

For multi-agent systems, the problem compounds. Each agent accumulates its own context, and handoffs between agents add even more tokens. Without compression, multi-agent orchestrations can burn through effective context capacity in minutes.

Why naive solutions fail

The obvious answer, bigger context windows, does not help. Models with 1M+ token windows still suffer from context rot. More space does not fix the attention dilution problem. It just delays when you notice it. (For more on why, see what is a context window.)

Simple truncation is equally dangerous. Dropping the oldest messages means losing the original task description, early decisions, and constraints that subsequent work depends on. An agent that forgets it was told “don’t modify the database schema” three steps ago can do real damage.

Naive summarization, compressing everything into a single summary, flattens the signal. A one-paragraph summary of 50 tool calls cannot preserve the specific file paths, error codes, and branching decisions that the agent needs to continue working. The summary tells the agent what happened but not the details it needs to act on.

What actually works

Recent work from Factory.ai, Google, Anthropic, and academic researchers has converged on a set of techniques that preserve task performance while dramatically reducing token usage.

Anchored iterative summarization

Factory.ai’s approach maintains a structured, persistent summary with explicit sections: session intent, file modifications, decisions made, and next steps. When compression triggers, only the newly truncated span is summarized and merged with the existing summary.

The key insight is that sections act as a checklist. The summarizer must populate each one or explicitly leave it empty, so critical details like file paths and decisions cannot silently disappear. Testing across 36,000 production messages, Factory’s approach scored 3.70 overall versus 3.44 for Anthropic’s method and 3.35 for OpenAI’s. The biggest gap was context awareness: Factory scored 4.01 versus Anthropic’s 3.56.

Tool response offloading

LangChain’s Deep Agents system takes a different approach for the biggest token consumers: tool outputs. When a tool response exceeds 20,000 tokens, the system offloads it to the filesystem and substitutes a file path reference with a 10-line preview. The agent can re-read or search the content as needed.

This is particularly effective for coding agents, where a single file read or search result can consume tens of thousands of tokens that the agent only needs to reference once.

Provider-native compaction

Both Anthropic and Google’s ADK now ship built-in compaction APIs. Anthropic’s compaction is available on the Claude API and AWS Bedrock, with Google ADK offering its own compaction module. These are simpler to adopt than custom solutions, though Factory’s evaluation suggests they sacrifice some context awareness for convenience.

Failure-driven compression optimization

The ACON framework from academic research takes a different approach entirely. It analyzes cases where full context succeeds but compressed context fails, then updates compression guidelines to avoid those failure patterns. Across benchmarks on AppWorld, OfficeBench, and multi-objective QA tasks, ACON reduced memory usage by 26-54% while largely preserving task performance.

The framework is gradient-free and works with any LLM, including API-based models, making it practical for production use.

Practical takeaways

If you are building or running long-lived agent sessions, three principles matter most:

Compress incrementally, not all at once. Merge new summaries into persistent state rather than regenerating from scratch. This is the core insight behind Factory’s 0.45-point lead in context awareness.

Preserve structure, not just content. A summary that maintains explicit sections for goals, decisions, and file paths outperforms a narrative paragraph. The agent needs to act on compressed context, not just read it.

Offload large artifacts. Tool outputs, file contents, and search results are the biggest token consumers and the easiest to externalize. Keep references in context and full content on disk. Context containers, like those Wire provides, take a complementary approach by externalizing context into queryable stores that agents access on demand, keeping the active window lean.

The aim of context engineering is not to maximize what fits in the window. It is to minimize what needs to be there while preserving everything the agent needs to stay on track.


Sources: Chroma: Context Rot Research · Factory.ai: Evaluating Context Compression · Factory.ai: Compressing Context · Zylos Research: AI Agent Context Compression · ACON: Optimizing Context Compression for Long-horizon LLM Agents · LangChain: Context Management for Deep Agents · Google ADK: Context Compaction · Elvex: Context Length Comparison 2026

Frequently asked questions

What is context compression in AI?
Context compression reduces the number of tokens in an AI agent's context window while preserving the information needed to complete tasks. Techniques range from structured summarization to embedding-based approaches that achieve 80-90% token reduction. The goal is to keep the working context small and focused so the model's attention isn't diluted across irrelevant history.
Why do AI agents lose track during long tasks?
AI agents accumulate conversation history, tool outputs, and observations as they work. As this context grows, transformer attention gets diluted across more tokens, causing the model to lose track of earlier decisions and instructions. Research shows this degradation begins well before the context window is full, typically around 60-70% of advertised capacity.
What causes context drift in AI agents?
Context drift occurs when accumulated history gradually pushes critical information, like the original task goal or key decisions, out of the model's effective attention range. Each new tool call or observation adds tokens, and the model increasingly favors recent context over earlier, often more important, information.
How do you compress AI context without losing important information?
The most effective approaches use structured summarization that maintains explicit sections for different information types: session intent, file modifications, decisions made, and next steps. Factory.ai's anchored iterative summarization scored 3.70 overall versus 3.44 for Anthropic's approach and 3.35 for OpenAI's, largely because merging into persistent state prevents key details from drifting across compression cycles.
What is the difference between context compression and RAG?
RAG retrieves relevant information from external sources on demand, keeping the context window lean from the start. Context compression reduces accumulated session history after it has already been generated. They solve different problems: RAG manages what enters the window, while compression manages what stays in it over time. Most production systems use both together.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container