Context engineering: the end of prompt engineering
Key takeaway
Context compression reduces the tokens in an AI agent's working memory while preserving the information it needs to complete tasks. Research shows techniques like structured summarization and selective retention cut memory usage by 26-54% while maintaining over 95% task accuracy. The shift from expanding context windows to compressing them marks a turning point in how production AI systems are built.
The context window race produced impressive numbers. By April 2026, five major models support 1 million tokens, one reaches 2 million, and Meta’s Llama 4 Scout claims 10 million. The assumption was simple: give agents more room, and they will perform better.
The assumption was wrong. The teams building the most capable AI agents in 2026 are not chasing larger windows. They are compressing context, deliberately reducing what agents carry forward so they can work with less noise and more signal. The shift from “fit more in” to “keep less, keep it better” may be the most important operational change in production AI this year.
The lost-in-the-middle problem demonstrated this clearly. Liu et al. measured a 30%+ accuracy drop on multi-document question answering when the answer document moved from position 1 to position 10 in a 20-document context. The U-shaped attention curve, caused by positional encoding biases in the transformer architecture, means models attend most strongly to the beginning and end of their input. Everything in the middle gets less attention.
Larger context windows do not fix this. They make it worse by giving agents more room to accumulate irrelevant history. Research analyzing 22 leading AI models found that effective context capacity is typically 60-70% of the advertised maximum. A model claiming 200,000 tokens becomes unreliable around 130,000. Performance drops suddenly rather than gradually.
This is the context rot problem at scale. As input length grows, the signal-to-noise ratio falls. Token count goes up. Attention per relevant token goes down. The model has access to everything and pays attention to less of it.
Context compression reduces the tokens in an agent’s working memory while preserving the information needed to complete the current task. Three main approaches have emerged:
Structured summarization replaces full conversation history with organized summaries. Factory.ai’s approach maintains explicit sections for session intent, file modifications, decisions made, and next steps. When the agent needs to continue working after compression, it has a structured map of what happened rather than a wall of raw messages.
Tool result offloading moves large outputs out of the active context. LangChain’s Deep Agents framework offloads tool results exceeding 20,000 tokens to the filesystem, replacing them with a file path reference and a 10-line preview. The agent can re-read the full content when needed, but it does not carry it in working memory between steps.
Provider-native compaction handles compression automatically at the API level. Anthropic’s compaction API compresses 150,000 tokens to 25,000 (an 83% reduction) by summarizing older conversation turns when the context approaches a configured threshold. The feature works across AWS Bedrock, Google Vertex AI, and Microsoft Foundry.
Each approach makes a different trade-off between precision and convenience. Structured summarization is the most surgical, preserving the highest-value details. Offloading is cheapest, requiring no LLM call. Native compaction is easiest to implement, requiring a single API parameter.
The ACON framework (Agent Context Optimization) provides the most rigorous evaluation. Researchers tested across AppWorld, OfficeBench, and Multi-objective QA benchmarks. ACON reduced peak token usage by 26-54% while largely preserving task performance. When they distilled the compression logic into smaller models, it preserved over 95% of accuracy.
The key insight from ACON is that compression is not one-size-fits-all. The framework uses paired trajectory analysis: it compares cases where full context succeeds but compressed context fails, identifies what information was lost, and updates the compression guidelines. This iterative process means the compression strategy improves over time for specific agent workflows.
Factory.ai’s evaluation framework tested a different dimension: what happens to agents after compression. They compared three approaches using probe-based evaluation that tests factual retention, file tracking, task planning, and reasoning chains. Their structured summarization scored 3.70 overall versus 3.44 for Anthropic’s native compression and 3.35 for OpenAI’s. The difference was in technical details: file paths, error messages, and specific decisions that generic summarization dropped.
LangChain’s autonomous compression takes a different approach entirely. Rather than compressing at fixed token thresholds, agents choose when to compress based on task boundaries, recognizing natural breakpoints like completing a research phase or finishing a code review. The full conversation history persists on disk as a safety net, allowing recovery if compression discards something important.
Not all tokens are created equal. The difference between effective and destructive compression is what gets kept:
What gets discarded: filler, superseded states, verbose tool outputs where only the conclusion matters, and redundant context that has already been incorporated into later decisions.
This is why structure matters before compression happens. When context is already organized into entities with relationships, structured entries with metadata, or explicit knowledge graphs, compression can operate at the semantic level rather than the token level. Tools like Wire take this approach by processing raw input into structured entries with relationships and embeddings at write time, so the compression question becomes “which entries are still relevant” rather than “which tokens can I remove.”
The principle applies regardless of tooling: the more structure your context has before compression, the more surgical the compression can be.
Context compression and prompt caching work against each other. When you compress earlier conversation turns, you change the cached prefix, which invalidates the cache and forces full recomputation of everything after.
For short sessions (under 20 turns), caching usually wins. The prefix stays stable, and the per-token savings from cache hits are large. For long sessions where context rot degrades quality, compression becomes necessary even at the cost of cache invalidation.
Some teams solve this by compressing at fixed intervals (every 50 turns) rather than continuously, minimizing cache resets while still managing context growth. Others combine both: cache the stable prefix (system prompt and tool definitions), compress conversation history, and accept a partial cache hit on each request.
The context window arms race is winding down. The teams building the most capable production agents have all converged on the same conclusion: managing what is in the window matters more than making the window bigger.
The practical playbook:
The agents that work best in 2026 are not the ones with the most context. They are the ones that carry exactly what they need and nothing more.
Sources: ACON: Optimizing Context Compression (arXiv) · Factory.ai: Evaluating Context Compression · Anthropic Compaction API · LangChain: Context Management for Deep Agents · Lost in the Middle (Liu et al.) · AI Context Windows: Why Bigger Isn’t Always Better · Factory.ai: Compressing Context · LangChain: Autonomous Context Compression
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container