Context compression: why less context means better AI

Jitpal Kocher · · 7 min read

Key takeaway

Context compression reduces the tokens in an AI agent's working memory while preserving the information it needs to complete tasks. Research shows techniques like structured summarization and selective retention cut memory usage by 26-54% while maintaining over 95% task accuracy. The shift from expanding context windows to compressing them marks a turning point in how production AI systems are built.

The context window race produced impressive numbers. By April 2026, five major models support 1 million tokens, one reaches 2 million, and Meta’s Llama 4 Scout claims 10 million. The assumption was simple: give agents more room, and they will perform better.

The assumption was wrong. The teams building the most capable AI agents in 2026 are not chasing larger windows. They are compressing context, deliberately reducing what agents carry forward so they can work with less noise and more signal. The shift from “fit more in” to “keep less, keep it better” may be the most important operational change in production AI this year.

Why bigger windows made things worse

The lost-in-the-middle problem demonstrated this clearly. Liu et al. measured a 30%+ accuracy drop on multi-document question answering when the answer document moved from position 1 to position 10 in a 20-document context. The U-shaped attention curve, caused by positional encoding biases in the transformer architecture, means models attend most strongly to the beginning and end of their input. Everything in the middle gets less attention.

Larger context windows do not fix this. They make it worse by giving agents more room to accumulate irrelevant history. Research analyzing 22 leading AI models found that effective context capacity is typically 60-70% of the advertised maximum. A model claiming 200,000 tokens becomes unreliable around 130,000. Performance drops suddenly rather than gradually.

This is the context rot problem at scale. As input length grows, the signal-to-noise ratio falls. Token count goes up. Attention per relevant token goes down. The model has access to everything and pays attention to less of it.

What context compression actually does

Context compression reduces the tokens in an agent’s working memory while preserving the information needed to complete the current task. Three main approaches have emerged:

Structured summarization replaces full conversation history with organized summaries. Factory.ai’s approach maintains explicit sections for session intent, file modifications, decisions made, and next steps. When the agent needs to continue working after compression, it has a structured map of what happened rather than a wall of raw messages.

Tool result offloading moves large outputs out of the active context. LangChain’s Deep Agents framework offloads tool results exceeding 20,000 tokens to the filesystem, replacing them with a file path reference and a 10-line preview. The agent can re-read the full content when needed, but it does not carry it in working memory between steps.

Provider-native compaction handles compression automatically at the API level. Anthropic’s compaction API compresses 150,000 tokens to 25,000 (an 83% reduction) by summarizing older conversation turns when the context approaches a configured threshold. The feature works across AWS Bedrock, Google Vertex AI, and Microsoft Foundry.

Each approach makes a different trade-off between precision and convenience. Structured summarization is the most surgical, preserving the highest-value details. Offloading is cheapest, requiring no LLM call. Native compaction is easiest to implement, requiring a single API parameter.

What the research shows

The ACON framework (Agent Context Optimization) provides the most rigorous evaluation. Researchers tested across AppWorld, OfficeBench, and Multi-objective QA benchmarks. ACON reduced peak token usage by 26-54% while largely preserving task performance. When they distilled the compression logic into smaller models, it preserved over 95% of accuracy.

The key insight from ACON is that compression is not one-size-fits-all. The framework uses paired trajectory analysis: it compares cases where full context succeeds but compressed context fails, identifies what information was lost, and updates the compression guidelines. This iterative process means the compression strategy improves over time for specific agent workflows.

Factory.ai’s evaluation framework tested a different dimension: what happens to agents after compression. They compared three approaches using probe-based evaluation that tests factual retention, file tracking, task planning, and reasoning chains. Their structured summarization scored 3.70 overall versus 3.44 for Anthropic’s native compression and 3.35 for OpenAI’s. The difference was in technical details: file paths, error messages, and specific decisions that generic summarization dropped.

LangChain’s autonomous compression takes a different approach entirely. Rather than compressing at fixed token thresholds, agents choose when to compress based on task boundaries, recognizing natural breakpoints like completing a research phase or finishing a code review. The full conversation history persists on disk as a safety net, allowing recovery if compression discards something important.

What good compression preserves

Not all tokens are created equal. The difference between effective and destructive compression is what gets kept:

  • Entities and relationships: names, IDs, file paths, configuration values
  • Decisions and their reasoning: not just what the agent decided, but why
  • Constraints and errors: what failed, what is not allowed, what was ruled out
  • Current task state: where the agent is in its workflow and what comes next

What gets discarded: filler, superseded states, verbose tool outputs where only the conclusion matters, and redundant context that has already been incorporated into later decisions.

This is why structure matters before compression happens. When context is already organized into entities with relationships, structured entries with metadata, or explicit knowledge graphs, compression can operate at the semantic level rather than the token level. Tools like Wire take this approach by processing raw input into structured entries with relationships and embeddings at write time, so the compression question becomes “which entries are still relevant” rather than “which tokens can I remove.”

The principle applies regardless of tooling: the more structure your context has before compression, the more surgical the compression can be.

The trade-off with caching

Context compression and prompt caching work against each other. When you compress earlier conversation turns, you change the cached prefix, which invalidates the cache and forces full recomputation of everything after.

For short sessions (under 20 turns), caching usually wins. The prefix stays stable, and the per-token savings from cache hits are large. For long sessions where context rot degrades quality, compression becomes necessary even at the cost of cache invalidation.

Some teams solve this by compressing at fixed intervals (every 50 turns) rather than continuously, minimizing cache resets while still managing context growth. Others combine both: cache the stable prefix (system prompt and tool definitions), compress conversation history, and accept a partial cache hit on each request.

What this means for production

The context window arms race is winding down. The teams building the most capable production agents have all converged on the same conclusion: managing what is in the window matters more than making the window bigger.

The practical playbook:

  1. Start with structure. Structured context compresses better than raw text because compression can operate on semantic units rather than raw tokens.
  2. Offload before summarizing. Large tool results should be moved to external storage, not summarized into the context.
  3. Compress at task boundaries. Natural breakpoints (completing a subtask, finishing research) preserve more coherence than fixed token thresholds.
  4. Keep a full history externally. Compression is lossy. Store the uncompressed conversation on disk or in a database so agents can recover if needed.
  5. Measure what matters. Traditional metrics like ROUGE do not tell you if an agent can continue working. Test factual retention, file tracking, and reasoning chains after compression.

The agents that work best in 2026 are not the ones with the most context. They are the ones that carry exactly what they need and nothing more.


Sources: ACON: Optimizing Context Compression (arXiv) · Factory.ai: Evaluating Context Compression · Anthropic Compaction API · LangChain: Context Management for Deep Agents · Lost in the Middle (Liu et al.) · AI Context Windows: Why Bigger Isn’t Always Better · Factory.ai: Compressing Context · LangChain: Autonomous Context Compression

Frequently asked questions

What is context compression in AI?
Context compression is the practice of reducing the volume of information in an AI agent's working memory while preserving everything it needs to complete its task. Techniques include structured summarization, tool result offloading, and selective retention. Research shows effective compression reduces token usage by 26-54% without meaningful accuracy loss.
Why do AI agents need context compression?
AI agents accumulate conversation history, tool outputs, and observations that dilute attention and degrade performance. Nearly 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning. Compression keeps the working context focused on what matters for the current task.
Does a bigger context window solve context problems?
No. Research shows models experience a 30%+ accuracy drop when relevant information sits in the middle of long contexts, and effective context capacity is typically 60-70% of the advertised maximum. Bigger windows give agents more room to accumulate noise, which makes the attention dilution problem worse.
How does structured summarization differ from truncation?
Truncation discards the oldest messages regardless of their importance, often losing critical decisions and constraints. Structured summarization maintains organized sections for session intent, file modifications, decisions made, and next steps. In Factory.ai's evaluation, structured summarization scored 3.70 on factual retention versus 3.44 for generic approaches.
What is Anthropic's compaction API?
Anthropic's compaction API automatically summarizes conversation history when it approaches the context window limit. In testing, it compressed 150,000 tokens to 25,000, an 83% reduction. The feature is available via the Messages API and works across AWS Bedrock, Google Vertex AI, and Microsoft Foundry.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container