Context compression: why less context means better AI
Key takeaway
Context compression is the practice of reducing token count in long-running AI agent sessions while preserving the information agents need to complete tasks. Research shows 65% of enterprise agent failures come from context drift during multi-step reasoning, not from hitting token limits. Effective compression techniques like anchored summarization and tool response offloading keep agents focused by curating what stays in the context window rather than expanding it.
Your AI agent starts a complex task, works through the first few steps perfectly, then somewhere around step 15 it starts repeating work it already did. By step 25, it has forgotten its own plan entirely.
This is not a hallucination problem. It is not a capability problem. According to Zylos Research, 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning, not raw context exhaustion. The agent had plenty of room in its context window. It just couldn’t effectively use what was in it.
Every action an AI agent takes generates tokens. Tool calls produce outputs. Observations get logged. Error messages pile up. A coding agent that starts with a clear 500-token task description might accumulate 80,000 tokens of history after 30 tool calls, with the original instructions now buried deep in the context.
This is where context rot kicks in. Chroma’s research across 18 frontier models found that when the context window is less than 50% full, models lose tokens in the middle. When it is more than 50% full, they lose the earliest tokens. Either way, the information that matters most (the original goal, key decisions, constraints) is exactly what gets lost.
The gap between advertised and effective context capacity makes this worse. A model claiming 200,000 tokens typically becomes unreliable around 130,000. That is roughly 60-70% of the advertised maximum. Your agent hits degradation well before it hits the limit.
For multi-agent systems, the problem compounds. Each agent accumulates its own context, and handoffs between agents add even more tokens. Without compression, multi-agent orchestrations can burn through effective context capacity in minutes.
The obvious answer, bigger context windows, does not help. Models with 1M+ token windows still suffer from context rot. More space does not fix the attention dilution problem. It just delays when you notice it. (For more on why, see what is a context window.)
Simple truncation is equally dangerous. Dropping the oldest messages means losing the original task description, early decisions, and constraints that subsequent work depends on. An agent that forgets it was told “don’t modify the database schema” three steps ago can do real damage.
Naive summarization, compressing everything into a single summary, flattens the signal. A one-paragraph summary of 50 tool calls cannot preserve the specific file paths, error codes, and branching decisions that the agent needs to continue working. The summary tells the agent what happened but not the details it needs to act on.
Recent work from Factory.ai, Google, Anthropic, and academic researchers has converged on a set of techniques that preserve task performance while dramatically reducing token usage.
Factory.ai’s approach maintains a structured, persistent summary with explicit sections: session intent, file modifications, decisions made, and next steps. When compression triggers, only the newly truncated span is summarized and merged with the existing summary.
The key insight is that sections act as a checklist. The summarizer must populate each one or explicitly leave it empty, so critical details like file paths and decisions cannot silently disappear. Testing across 36,000 production messages, Factory’s approach scored 3.70 overall versus 3.44 for Anthropic’s method and 3.35 for OpenAI’s. The biggest gap was context awareness: Factory scored 4.01 versus Anthropic’s 3.56.
LangChain’s Deep Agents system takes a different approach for the biggest token consumers: tool outputs. When a tool response exceeds 20,000 tokens, the system offloads it to the filesystem and substitutes a file path reference with a 10-line preview. The agent can re-read or search the content as needed.
This is particularly effective for coding agents, where a single file read or search result can consume tens of thousands of tokens that the agent only needs to reference once.
Both Anthropic and Google’s ADK now ship built-in compaction APIs. Anthropic’s compaction is available on the Claude API and AWS Bedrock, with Google ADK offering its own compaction module. These are simpler to adopt than custom solutions, though Factory’s evaluation suggests they sacrifice some context awareness for convenience.
The ACON framework from academic research takes a different approach entirely. It analyzes cases where full context succeeds but compressed context fails, then updates compression guidelines to avoid those failure patterns. Across benchmarks on AppWorld, OfficeBench, and multi-objective QA tasks, ACON reduced memory usage by 26-54% while largely preserving task performance.
The framework is gradient-free and works with any LLM, including API-based models, making it practical for production use.
If you are building or running long-lived agent sessions, three principles matter most:
Compress incrementally, not all at once. Merge new summaries into persistent state rather than regenerating from scratch. This is the core insight behind Factory’s 0.45-point lead in context awareness.
Preserve structure, not just content. A summary that maintains explicit sections for goals, decisions, and file paths outperforms a narrative paragraph. The agent needs to act on compressed context, not just read it.
Offload large artifacts. Tool outputs, file contents, and search results are the biggest token consumers and the easiest to externalize. Keep references in context and full content on disk. Context containers, like those Wire provides, take a complementary approach by externalizing context into queryable stores that agents access on demand, keeping the active window lean.
The aim of context engineering is not to maximize what fits in the window. It is to minimize what needs to be there while preserving everything the agent needs to stay on track.
Sources: Chroma: Context Rot Research · Factory.ai: Evaluating Context Compression · Factory.ai: Compressing Context · Zylos Research: AI Agent Context Compression · ACON: Optimizing Context Compression for Long-horizon LLM Agents · LangChain: Context Management for Deep Agents · Google ADK: Context Compaction · Elvex: Context Length Comparison 2026
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container