Context Engineering AI Agent Context Window Prompt Caching

Context budgets: how to allocate tokens for AI agents

Jitpal Kocher · April 13, 2026 · 11 min read

Key takeaway

A context budget is an explicit allocation of tokens across the parts of an AI agent's prompt: system instructions, tools, retrieval, history, and buffer. Production teams typically allocate 10 to 15 percent to system prompts, 15 to 20 percent to tools, 30 to 40 percent to retrieval, 20 to 30 percent to history, and reserve 10 to 15 percent as a buffer. Treating the context window as a budget rather than storage prevents quality loss from attention degradation and can cut AI agent costs by more than half.

A customer service agent processing 10,000 conversations a day costs roughly $255,000 a year when context is unmanaged, and about $102,000 a year after a 60 percent context reduction. Same model, same task, same volume. The difference is whether the team treats the context window as storage or as a budget.

Most teams treat it as storage. They paste the entire conversation history, the full document, every tool definition, and the original system prompt into every API call, and then wonder why agent costs spiral and accuracy gets worse on long sessions. Teams running agents in production treat the context window the way a finance team treats cash: a finite pool that has to be allocated deliberately across competing claims.

This guide covers what a context budget is, the five categories of context spend, a reference allocation that works for most production agents, how that allocation should shift by agent type, and the five rules for spending wisely.

A context budget allocates tokens across system prompts, tools, retrieval, history, and a buffer

A context budget is an explicit allocation of the model’s token window across the parts of the prompt that compete for it. The five categories are system instructions, tool definitions, retrieved knowledge, conversation history, and a reserved buffer for output. Every token that goes into one category is a token that cannot go into another.

The reason this matters more than total window size is that bigger windows do not give you more usable tokens. Chroma tested 18 frontier models in 2025, including Claude Opus 4, GPT-4.1, and Gemini 2.5, and found that every single one degraded as input length grew. The drop was 30 percent or more on multi-document retrieval when the relevant content sat in the middle of the window. Practitioner reports estimate that models advertised at 200,000 tokens become unreliable around 130,000 tokens, a 30 to 35 percent gap between marketing and effective capacity.

Anthropic’s context engineering guide frames this as an “attention budget.” Their phrasing: “every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available.” The implication is that the goal of context engineering is not to fill the window but to find the smallest set of high-signal tokens that produces the desired result.

A budget makes that allocation explicit instead of accidental.

Five categories of context spend

Every token in the prompt belongs to one of five categories. Each one has different growth dynamics and different consequences when it gets out of hand.

Category	What it covers	Typical size
System prompt	Behavior, constraints, persona, output format	500 to 5,000 tokens
Tool definitions	JSON schemas for callable tools, descriptions, examples	200 to 500 tokens per tool
Retrieved knowledge	RAG passages, vector search results, structured data, or context containers (e.g. Wire) pulled at query time	Highly variable, often 1,000 to 30,000 tokens
Conversation history	Prior turns, tool outputs, intermediate reasoning	Grows linearly with session length
Output buffer	Reserved space for the model’s response and reasoning tokens	2,000 tokens for non-reasoning, 10,000 or more for reasoning models

The system prompt is the smallest category but has the strongest influence per token. Tools are usually small individually but add up when an agent has access to capabilities it does not need for the task at hand. Retrieved knowledge is the most volatile, swinging from a few hundred to tens of thousands of tokens depending on the query. History is the most dangerous over time because it compounds with every turn. The buffer exists because if the model runs out of room to respond, the request fails or the response gets truncated mid-sentence.

A reference allocation for production agents

A balanced allocation for most production agents is 10 to 15 percent system, 15 to 20 percent tools, 30 to 40 percent retrieval, 20 to 30 percent history, and 10 to 15 percent buffer. This split, recommended by Maxim AI based on production benchmarks across customer-facing agents, is a starting point rather than a rule.

Applied to common context windows, the allocation looks like this:

Category	32k window	200k window	1M window
System prompt (12%)	3,840	24,000	120,000
Tool definitions (17%)	5,440	34,000	170,000
Retrieved knowledge (35%)	11,200	70,000	350,000
Conversation history (25%)	8,000	50,000	250,000
Output buffer (11%)	3,520	22,000	110,000

The numbers in the 1M column are theoretical capacity, not what you should actually use. The Chroma findings on context rot mean that filling a 1M window with retrieval and history is the fastest way to make a frontier model perform like a much weaker one. A safer practice is to budget against the model’s effective context limit, typically 60 to 70 percent of the advertised maximum, and let the rest serve as headroom.

Two of the categories are also fixed-cost rather than truly proportional. A typical system prompt tops out around 5,000 tokens regardless of window size, and 15 tools at 500 tokens each is 7,500 tokens. As windows grow, the fixed-cost categories shrink as a percentage and the extra capacity flows to retrieval, history, and the buffer. Treat the percentages as a starting point for windows in the 32k to 128k range and convert to absolute token counts for anything larger.

The reasoning behind each band:

10 to 15 percent for system prompts because instructions punch above their weight per token. Going beyond 5,000 tokens of system prompt produces diminishing returns and often introduces conflicting guidance.
15 to 20 percent for tools because most agents need 5 to 15 tools, each with a 200 to 500 token schema. More than 20 tools and the agent starts struggling to choose between them.
30 to 40 percent for retrieval because this is the load-bearing category for grounding. Skimping here is the most common cause of hallucinations.
20 to 30 percent for history because that’s where most context bloat hides. Anything more is usually a sign that the agent needs compression at handoffs.
10 to 15 percent for buffer because thinking models burn reasoning tokens that count against the window, and a truncated response is worse than a slightly shorter one.

How allocations shift by agent type

Different agent types have different load profiles. A coding agent spends most of its budget on retrieval. A customer support agent spends most of it on history. A research agent uses heavy tools. A one-shot summarization agent has almost no history at all.

Agent type	System	Tools	Retrieval	History	Buffer
Coding agent	10%	20%	45%	15%	10%
Customer support	12%	12%	25%	40%	11%
Research agent	10%	25%	35%	20%	10%
Workflow automation	15%	25%	20%	30%	10%
One-shot summarizer	10%	0%	70%	0%	20%

Coding agents lean heavy on retrieval because they need to load relevant files, dependencies, and call sites for each task. The history category stays smaller because most coding tasks reset between features. Customer support agents flip the ratio: a 30-message thread accumulates fast, and resolving the issue depends on what the customer said three turns ago more than on a large knowledge base.

Research agents balance retrieval with a generous tools budget because they invoke search, browse, and citation tools repeatedly. Workflow automation agents need bigger tool definitions because they orchestrate calls across multiple systems, each with its own schema. One-shot agents spend nothing on tools or history and reserve a large buffer because their entire job is producing the output.

The takeaway is that the reference 10/17/35/25/11 split is a default to deviate from. Pick the closest agent archetype, adjust based on your traffic, and measure.

Five rules for spending wisely

A budget is only as good as the discipline behind it. These five rules cover where most teams overspend.

Cache the static parts

System prompts and tool definitions rarely change between requests, so they should always be cached. Prompt caching cuts costs by 90 percent on the cached portion. Across a session, that turns the system-and-tools chunk from a recurring expense into a one-time setup cost. The catch is that the cache invalidates if anything in the cached prefix changes, so put dynamic content like timestamps and user IDs at the end of the prompt, never in the system block.

Compress at handoffs

Conversation history is where context bloat is hardest to spot. By turn 30, the original goal is buried under tool outputs, intermediate reasoning, and resolved sub-tasks. Summarize older turns into a compact running record that preserves decisions and constraints but drops process noise. Compression ratios of 3:1 to 5:1 work well for history; tool outputs can often compress 10:1 or more without losing actionable signal.

Use just-in-time retrieval

Pre-loading every file an agent might need is the multi-agent equivalent of buying groceries for the whole year. Instead, give the agent retrieval tools and let it pull what it needs when it needs it. Anthropic’s guidance for coding agents is to pre-load a small set of always-relevant instructions and dynamically retrieve everything else. The result is a smaller resident context with higher signal density.

Load only the tools the agent needs for the task

Every tool definition sits near the front of the prompt and consumes 200 to 500 tokens on every request, whether the agent uses it or not. A customer support agent that answers knowledge base questions and looks up package delivery status does not need a weather tool, a calendar tool, or a code execution tool. Scope the tool set to the task. If an agent serves multiple tasks, pick the toolset at session start based on the user’s intent, or gate rarely used tools behind a routing layer that loads them only when the session enters a state that needs them. The same applies to tool descriptions: prune verbose examples until they earn their place.

Reserve a real output buffer

Reasoning models burn thinking tokens that count against the window. If a model needs 8,000 tokens to think through a problem and 2,000 to write the answer, but the budget left only 4,000 tokens of headroom, the response gets cut off or the request fails. Always reserve at least 10 percent of the window for output, and more for reasoning-heavy tasks.

Symptoms of an overspent context budget

An overspent context budget shows up as slower responses, higher bills, and quietly worse answers. The cost and latency symptoms are obvious; the accuracy symptoms are not, because the model rarely tells you it lost track of something halfway down a 150,000-token prompt.

A short diagnostic checklist:

Latency creep: time-to-first-token rises across a session. Usually means the cached prefix is being invalidated by dynamic content, or the model is processing more tokens than its sweet spot.
Bill outpacing usage: token costs grow faster than session count. Usually quadratic history growth, or pre-loaded retrieval that compounds with every turn.
Mid-session forgetting: the agent loses earlier instructions or context. Symptom of context rot plus history that pushed the system prompt past the high-attention zone.
Tool selection errors: agent picks the wrong tool more often as sessions grow. Tool list got too long, or older tool calls are bleeding into reasoning.
Truncated outputs: responses cut off mid-sentence. Output buffer is too small, especially with reasoning models.

If two or more of these are present, the agent is over budget. Tighten allocations before adding a more expensive model, because a larger context window will not fix any of these symptoms and usually makes them worse.

Where this fits in your architecture

Context budgeting sits one layer above prompt design and one layer below architecture choice. It is the discipline that makes the difference between an agent that works in a demo and an agent that holds up at 10,000 sessions a day.

The pattern that consistently works in production is to define the budget per agent type, instrument every category, and treat any single category exceeding its share as a signal to compress, route, or split rather than to expand the window.

The agents that scale are not the ones with the largest context windows. They are the ones with the most disciplined budgets.

Sources: Chroma: Context Rot · Anthropic: Effective Context Engineering for AI Agents · Maxim AI: Context Engineering for AI Agents · Liu et al. 2024: Lost in the Middle (TACL) · Anthropic: Prompt Caching Documentation · Stevens Online: Hidden Economics of AI Agents · Weaviate: Context Engineering

Frequently asked questions

What percentage of the context window should the system prompt use?

Most production agents allocate 10 to 15 percent of the context window to the system prompt, which is roughly 500 to 2,000 tokens for typical workloads and up to 5,000 tokens for complex agents with many constraints. Going higher rarely improves accuracy, because system instructions have disproportionate influence per token. Pad the system prompt only when failure modes show the model needs more guardrails.

How do you calculate a context budget for an AI agent?

Start by counting the tokens consumed by your system prompt, tool definitions, average retrieval payload, and accumulated history across a typical session. Subtract that from the model's effective context limit (which is usually around 65 percent of the advertised window) and reserve at least 10 percent for output and any reasoning tokens. The leftover is your working budget.

Should you always use the largest context window available?

No. Chroma's 2025 study of 18 frontier models found that every one degrades on accuracy as input length grows, even within their advertised limits. A 200,000-token model is typically reliable to about 130,000 tokens before performance drops sharply. Using a smaller window with tighter allocation usually outperforms filling a larger window.

How does context budgeting differ between RAG and long-context approaches?

RAG architectures sit at the upper end of the retrieval band, around 35 to 40 percent, and proportionally less on history, because the retrieval step compresses external knowledge to only what each query needs. Long-context approaches load larger payloads upfront, which means tighter caps on tools and history to avoid hitting the effective context limit. Both still need a buffer for output.

What happens when an AI agent exceeds its context budget?

The model either truncates older content silently, fails the request with a context-length error, or accepts the input but loses accuracy because relevant tokens get pushed into the lower-attention middle of the window. Most providers truncate from the start of the conversation, which removes the original system instructions and goals. Explicit budgeting prevents these failure modes by deciding what to drop before the model has to.

Context Compression Context Engineering

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container