Context compression: why less context means better AI
Key takeaway
Most teams treat rising AI costs as a model pricing problem. The data tells a different story: context architecture decisions drive 60-70% of total AI spend. Token prices have fallen 280x in two years while enterprise budgets grew 320%, because agentic workflows multiply token volume 10-20x per task. The teams spending the least per task got there by restructuring context delivery, not by switching models.
Token prices have fallen roughly 280x in two years. Enterprise AI budgets have risen 320% over the same period, from an average of $1.2 million in 2024 to $7 million in 2026. The math should not work this way. Cheaper units should mean lower bills.
The math works this way because AI costs are not a pricing problem. They are a context problem. How you structure, deliver, and manage context determines 60-70% of your total AI spend, and most teams have never audited that architecture.
Per-token prices are falling fast, and that is real. Claude Opus 4.6 costs $5 per million input tokens. GPT-5.2 is at $1.75. Haiku 4.5 is at $1. Two years ago, comparable capabilities cost hundreds of dollars per million tokens. The price curve looks like a gift.
The illusion is assuming that total cost tracks per-token price. It does not, because agentic workflows changed the multiplier. A chatbot makes one LLM call per user message. An AI agent makes 10-20 calls per task: planning, tool selection, execution, verification, error recovery, response generation. Each call re-sends the full context: system instructions, tool definitions, conversation history, retrieved documents.
Inference now accounts for 85% of the enterprise AI budget in 2026, and the FinOps Foundation reports that 98% of organizations are actively managing AI spend, up from 31% two years ago. The bills got people’s attention. But most teams are optimizing the wrong variable.
An unconstrained AI agent solving a software engineering task costs $5-8 in API fees per task. A multi-step research agent can burn through $5-15 in minutes. To understand why, look at where the tokens go.
Manus, before its acquisition by Meta, reported an average input-to-output ratio of 100:1. For every token the model generates, it processes 100 tokens of context. That context is the system prompt, tool definitions, conversation history, retrieved documents, and prior tool results, all re-sent on every single API call.
The cost structure reinforces this. In 2026, the median output-to-input cost ratio across major providers sits at approximately 4:1. Output tokens are more expensive per unit, but input tokens dominate total spend because there are so many more of them. When your agent makes 15 calls per task and each call sends 50,000 tokens of context, you are paying to re-process 750,000 input tokens for what might produce 7,500 tokens of output.
This is the context rot problem expressed in dollars. As agents accumulate history, every subsequent call carries more weight. Token count goes up. Cost per task goes up. And most of those tokens are the same ones the model already processed on the previous call.
Context engineering is as much an economic decision as a technical one. Four architectural decisions determine where most of your AI budget goes.
Structured context carries the same information in fewer tokens than raw text. A well-organized JSON payload with labeled fields and relationships transmits meaning more densely than a wall of unprocessed prose. Research on structured versus raw text for AI shows that models extract information more reliably from structured input, which also means fewer retry cycles and less wasted compute.
The cost impact compounds across every API call. If structuring your context reduces input by even 30%, and your agent makes 15 calls per task, you save 30% across all 15 calls. Pre-processing context at upload time rather than re-processing it on every query is the cost-efficient pattern. Wire containers take this approach: files are processed into structured, queryable formats once, and agents retrieve only the relevant entries on each call rather than ingesting raw documents repeatedly.
Context compression reduces what agents carry forward between steps. Factory.ai’s structured summarization approach maintained task accuracy while cutting memory usage by 26-54%. LangChain’s Deep Agents framework offloads tool results exceeding 20,000 tokens to the filesystem, replacing them with a file path and a 10-line preview.
The arXiv paper on AgentDiet found that agent trajectories accumulate useless, redundant, and expired information that can be removed without harming performance. Another study, CigaR, demonstrated 73% token cost reduction through context-aware optimization that focuses the model on key information rather than processing everything. The pattern is consistent: agents perform as well or better with less context, and every removed token saves money. Our deep dive on context compression covers the techniques in detail.
Prompt caching is the single highest-leverage cost optimization for agents with stable prefixes. Anthropic offers 90% discounts on cached input tokens. Research across 500+ agent sessions found 41-80% cost reduction from system-prompt-only caching, with time-to-first-token improvements of 13-31%.
The key constraint is prefix stability. Caching only works when the beginning of each prompt stays identical. Dynamic timestamps, changing tool definitions, or compression that rewrites earlier turns all invalidate the cache. Structuring prompts with static content first and dynamic content last is the foundational pattern. Our prompt caching deep dive covers implementation in detail.
Model routing sends routine tasks to smaller, cheaper models and reserves frontier models for complex reasoning. 60-70% of LLM calls in a typical agent pipeline are routine tasks (classification, extraction, formatting) that small models handle well. A task routed to a frontier reasoning model may cost 190x more than the same task handled by a fast, smaller model.
Cascading approaches, where tasks start at a cheap model and escalate only when needed, regularly achieve 60-87% cost reduction because the expensive models only process what they must.
| Strategy | Savings | How it works | Trade-off |
|---|---|---|---|
| Model downgrade (e.g. Opus to Haiku) | 5x per token | Lower price per unit | Linear savings, potential quality loss |
| Prompt caching | 41-80% total | Reuse computed prefix | Requires stable prompt structure |
| Context compression | 26-54% total | Reduce tokens carried forward | May lose edge-case details |
| Structured context | 20-40% total | Denser information per token | Requires pre-processing pipeline |
| Model routing | 60-87% total | Match task to model | Needs classifier or complexity scoring |
| Combined context optimization | 70-90% total | All of the above | Multiplicative across all calls |
Switching from Opus to Haiku saves 5x per token. That is a meaningful reduction on any individual call. But it is linear: you pay less per token while processing the same volume of tokens.
Restructuring context architecture is multiplicative. Compression reduces the tokens. Caching avoids re-processing them. Structuring makes each token carry more information. Routing sends the reduced, structured, cached context to the right model. Each layer compounds on the others.
A team that switches to a cheaper model saves 5x. A team that restructures context and keeps the better model saves 70-90% while maintaining quality. The context engineering approach treats cost as a design outcome, not a procurement decision.
The teams with the lowest per-task costs share a pattern: they treat context architecture as the primary cost lever.
Manus made KV-cache hit rate their north star metric for production agents. By structuring every prompt so the prefix stayed stable across calls, they maximized caching savings on every request in a session. The 100:1 input-to-output ratio meant caching had an outsized impact on their total bill.
The AgentDiet research team built trajectory reduction into the agent loop itself, automatically identifying and removing expired, redundant, and irrelevant information from the agent’s accumulated history. The result was lower token counts without degraded performance, because the removed content was not contributing to task completion anyway.
The CigaR framework took a different angle: rather than removing context, it optimized how context was presented to maximize information density per token. The 73% cost reduction came not from sending less, but from sending the same information in a more token-efficient format.
These are not competing approaches. They are layers in the same stack: structure the context, compress what accumulates, cache what repeats, route what remains. Each one reduces the token volume that the next one operates on.
If your AI costs are climbing, the diagnostic is straightforward:
The 280x drop in token prices is real. But for teams running AI agents in production, the bill is set by architecture, not by pricing. The teams spending the least per task are not the ones who found the cheapest model. They are the ones who send the right context, in the right format, at the right time.
Sources: Deloitte: AI Token Spend Dynamics · Zylos Research: AI Agent Cost Optimization · Stevens: Hidden Economics of AI Agents · Manus: Context Engineering Lessons · Don’t Break the Cache (arXiv) · CigaR: Cost-Efficient Program Repair (arXiv) · AgentDiet: Trajectory Reduction (arXiv) · Factory.ai: Evaluating Compression · FinOps Foundation: GenAI Token Pricing · AnalyticsWeek: Inference Economics
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container