Context compression: why less context means better AI
Key takeaway
A context budget is an explicit allocation of tokens across the parts of an AI agent's prompt: system instructions, tools, retrieval, history, and buffer. Production teams typically allocate 10 to 15 percent to system prompts, 15 to 20 percent to tools, 30 to 40 percent to retrieval, 20 to 30 percent to history, and reserve 10 to 15 percent as a buffer. Treating the context window as a budget rather than storage prevents quality loss from attention degradation and can cut AI agent costs by more than half.
A customer service agent processing 10,000 conversations a day costs roughly $255,000 a year when context is unmanaged, and about $102,000 a year after a 60 percent context reduction. Same model, same task, same volume. The difference is whether the team treats the context window as storage or as a budget.
Most teams treat it as storage. They paste the entire conversation history, the full document, every tool definition, and the original system prompt into every API call, and then wonder why agent costs spiral and accuracy gets worse on long sessions. Teams running agents in production treat the context window the way a finance team treats cash: a finite pool that has to be allocated deliberately across competing claims.
This guide covers what a context budget is, the five categories of context spend, a reference allocation that works for most production agents, how that allocation should shift by agent type, and the five rules for spending wisely.
A context budget is an explicit allocation of the model’s token window across the parts of the prompt that compete for it. The five categories are system instructions, tool definitions, retrieved knowledge, conversation history, and a reserved buffer for output. Every token that goes into one category is a token that cannot go into another.
The reason this matters more than total window size is that bigger windows do not give you more usable tokens. Chroma tested 18 frontier models in 2025, including Claude Opus 4, GPT-4.1, and Gemini 2.5, and found that every single one degraded as input length grew. The drop was 30 percent or more on multi-document retrieval when the relevant content sat in the middle of the window. Practitioner reports estimate that models advertised at 200,000 tokens become unreliable around 130,000 tokens, a 30 to 35 percent gap between marketing and effective capacity.
Anthropic’s context engineering guide frames this as an “attention budget.” Their phrasing: “every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available.” The implication is that the goal of context engineering is not to fill the window but to find the smallest set of high-signal tokens that produces the desired result.
A budget makes that allocation explicit instead of accidental.
Every token in the prompt belongs to one of five categories. Each one has different growth dynamics and different consequences when it gets out of hand.
| Category | What it covers | Typical size |
|---|---|---|
| System prompt | Behavior, constraints, persona, output format | 500 to 5,000 tokens |
| Tool definitions | JSON schemas for callable tools, descriptions, examples | 200 to 500 tokens per tool |
| Retrieved knowledge | RAG passages, vector search results, structured data, or context containers (e.g. Wire) pulled at query time | Highly variable, often 1,000 to 30,000 tokens |
| Conversation history | Prior turns, tool outputs, intermediate reasoning | Grows linearly with session length |
| Output buffer | Reserved space for the model’s response and reasoning tokens | 2,000 tokens for non-reasoning, 10,000 or more for reasoning models |
The system prompt is the smallest category but has the strongest influence per token. Tools are usually small individually but add up when an agent has access to capabilities it does not need for the task at hand. Retrieved knowledge is the most volatile, swinging from a few hundred to tens of thousands of tokens depending on the query. History is the most dangerous over time because it compounds with every turn. The buffer exists because if the model runs out of room to respond, the request fails or the response gets truncated mid-sentence.
A balanced allocation for most production agents is 10 to 15 percent system, 15 to 20 percent tools, 30 to 40 percent retrieval, 20 to 30 percent history, and 10 to 15 percent buffer. This split, recommended by Maxim AI based on production benchmarks across customer-facing agents, is a starting point rather than a rule.
Applied to common context windows, the allocation looks like this:
| Category | 32k window | 200k window | 1M window |
|---|---|---|---|
| System prompt (12%) | 3,840 | 24,000 | 120,000 |
| Tool definitions (17%) | 5,440 | 34,000 | 170,000 |
| Retrieved knowledge (35%) | 11,200 | 70,000 | 350,000 |
| Conversation history (25%) | 8,000 | 50,000 | 250,000 |
| Output buffer (11%) | 3,520 | 22,000 | 110,000 |
The numbers in the 1M column are theoretical capacity, not what you should actually use. The Chroma findings on context rot mean that filling a 1M window with retrieval and history is the fastest way to make a frontier model perform like a much weaker one. A safer practice is to budget against the model’s effective context limit, typically 60 to 70 percent of the advertised maximum, and let the rest serve as headroom.
Two of the categories are also fixed-cost rather than truly proportional. A typical system prompt tops out around 5,000 tokens regardless of window size, and 15 tools at 500 tokens each is 7,500 tokens. As windows grow, the fixed-cost categories shrink as a percentage and the extra capacity flows to retrieval, history, and the buffer. Treat the percentages as a starting point for windows in the 32k to 128k range and convert to absolute token counts for anything larger.
The reasoning behind each band:
Different agent types have different load profiles. A coding agent spends most of its budget on retrieval. A customer support agent spends most of it on history. A research agent uses heavy tools. A one-shot summarization agent has almost no history at all.
| Agent type | System | Tools | Retrieval | History | Buffer |
|---|---|---|---|---|---|
| Coding agent | 10% | 20% | 45% | 15% | 10% |
| Customer support | 12% | 12% | 25% | 40% | 11% |
| Research agent | 10% | 25% | 35% | 20% | 10% |
| Workflow automation | 15% | 25% | 20% | 30% | 10% |
| One-shot summarizer | 10% | 0% | 70% | 0% | 20% |
Coding agents lean heavy on retrieval because they need to load relevant files, dependencies, and call sites for each task. The history category stays smaller because most coding tasks reset between features. Customer support agents flip the ratio: a 30-message thread accumulates fast, and resolving the issue depends on what the customer said three turns ago more than on a large knowledge base.
Research agents balance retrieval with a generous tools budget because they invoke search, browse, and citation tools repeatedly. Workflow automation agents need bigger tool definitions because they orchestrate calls across multiple systems, each with its own schema. One-shot agents spend nothing on tools or history and reserve a large buffer because their entire job is producing the output.
The takeaway is that the reference 10/17/35/25/11 split is a default to deviate from. Pick the closest agent archetype, adjust based on your traffic, and measure.
A budget is only as good as the discipline behind it. These five rules cover where most teams overspend.
System prompts and tool definitions rarely change between requests, so they should always be cached. Prompt caching cuts costs by 90 percent on the cached portion. Across a session, that turns the system-and-tools chunk from a recurring expense into a one-time setup cost. The catch is that the cache invalidates if anything in the cached prefix changes, so put dynamic content like timestamps and user IDs at the end of the prompt, never in the system block.
Conversation history is where context bloat is hardest to spot. By turn 30, the original goal is buried under tool outputs, intermediate reasoning, and resolved sub-tasks. Summarize older turns into a compact running record that preserves decisions and constraints but drops process noise. Compression ratios of 3:1 to 5:1 work well for history; tool outputs can often compress 10:1 or more without losing actionable signal.
Pre-loading every file an agent might need is the multi-agent equivalent of buying groceries for the whole year. Instead, give the agent retrieval tools and let it pull what it needs when it needs it. Anthropic’s guidance for coding agents is to pre-load a small set of always-relevant instructions and dynamically retrieve everything else. The result is a smaller resident context with higher signal density.
Every tool definition sits near the front of the prompt and consumes 200 to 500 tokens on every request, whether the agent uses it or not. A customer support agent that answers knowledge base questions and looks up package delivery status does not need a weather tool, a calendar tool, or a code execution tool. Scope the tool set to the task. If an agent serves multiple tasks, pick the toolset at session start based on the user’s intent, or gate rarely used tools behind a routing layer that loads them only when the session enters a state that needs them. The same applies to tool descriptions: prune verbose examples until they earn their place.
Reasoning models burn thinking tokens that count against the window. If a model needs 8,000 tokens to think through a problem and 2,000 to write the answer, but the budget left only 4,000 tokens of headroom, the response gets cut off or the request fails. Always reserve at least 10 percent of the window for output, and more for reasoning-heavy tasks.
An overspent context budget shows up as slower responses, higher bills, and quietly worse answers. The cost and latency symptoms are obvious; the accuracy symptoms are not, because the model rarely tells you it lost track of something halfway down a 150,000-token prompt.
A short diagnostic checklist:
If two or more of these are present, the agent is over budget. Tighten allocations before adding a more expensive model, because a larger context window will not fix any of these symptoms and usually makes them worse.
Context budgeting sits one layer above prompt design and one layer below architecture choice. It is the discipline that makes the difference between an agent that works in a demo and an agent that holds up at 10,000 sessions a day.
The pattern that consistently works in production is to define the budget per agent type, instrument every category, and treat any single category exceeding its share as a signal to compress, route, or split rather than to expand the window.
The agents that scale are not the ones with the largest context windows. They are the ones with the most disciplined budgets.
Sources: Chroma: Context Rot · Anthropic: Effective Context Engineering for AI Agents · Maxim AI: Context Engineering for AI Agents · Liu et al. 2024: Lost in the Middle (TACL) · Anthropic: Prompt Caching Documentation · Stevens Online: Hidden Economics of AI Agents · Weaviate: Context Engineering
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container