Context budgets: how to allocate tokens for AI agents
Every AI model advertises a number: 200K tokens, 1 million tokens, 2 million tokens. These numbers keep climbing, and marketing teams treat each jump like a breakthrough. But what does “context window” actually mean, and why does the number on the box matter less than you think?
A context window is the total amount of text a language model can process in a single request. Think of it as working memory: everything the model can “see” while generating a response. Once the window is full, older content gets pushed out or truncated.
The unit of measurement is tokens, not words. A token is roughly three-quarters of a word in English, so 1,000 tokens is about 750 words. A 200K token window holds roughly 150,000 words, or about 500 pages of text.
Everything competes for space in this window: your system prompt, the conversation history, any documents retrieved by RAG, tool definitions, and the model’s own output. The “context window size” you see advertised is the total budget for all of it.
Context windows have grown dramatically over the past two years:
| Model | Context window | Notes |
|---|---|---|
| Claude Opus 4.6 | 1M tokens | Beta; 200K standard |
| Gemini 3 Pro | 1-2M tokens | 2M via Vertex AI |
| GPT-5.2 | 400K tokens | Up from 128K in GPT-4 |
These are large numbers. A 1 million token window can hold roughly 7,500 pages of text. In theory, you could load an entire codebase, a full legal discovery set, or years of customer support tickets into a single prompt.
In practice, that’s rarely a good idea.
Research consistently shows that models don’t use their full context window effectively. The gap between advertised capacity and effective capacity is substantial.
Elvex’s 2026 benchmarks found that effective capacity is roughly 60-70% of the advertised maximum. A model with a 200K token window typically becomes unreliable around 130K tokens. The drop is often sudden rather than gradual: performance holds steady, then falls off a cliff.
Stanford researchers documented the “lost in the middle” effect, showing that models handle information at the beginning and end of their context far better than information in the middle. Their experiments found a 15-20 percentage point accuracy drop for middle-positioned content. Where you place information in the window matters as much as whether it fits.
Chroma’s research on context rot tested 18 leading models on trivially simple tasks (basic retrieval, text replication, fact extraction) and found accuracy dropped from 95% to 60-70% as input length increased. The tasks didn’t get harder. The only variable was how much text surrounded the answer. More context meant worse results on the same question.
NVIDIA’s RULER benchmark tested models on tasks beyond simple needle-in-a-haystack retrieval and found that most models claiming 32K+ token windows couldn’t effectively handle even 32K tokens on realistic tasks. Only a handful maintained acceptable performance, and even the best (GPT-4 at the time) showed a 15-point degradation at 128K. Passing a needle-in-a-haystack test, where a model finds one isolated fact in padding text, does not mean a model can reason across a full window of real information.
If bigger windows don’t automatically mean better results, what does? The shift in thinking is from “how much fits” to “what should go in.” This is the core of context engineering: designing systems that deliver the right information at the right time.
Selective retrieval. Instead of loading everything into the window, retrieve only what’s relevant to the current query. This is what RAG systems do when implemented well. A focused 5,000-token context often outperforms a sprawling 100,000-token context because the model can attend to all of it effectively.
Structured context. Raw text dumps are harder for models to navigate than organized, structured information. JSON, XML, databases with queryable fields, or purpose-built context containers give models a clearer signal-to-noise ratio than walls of unformatted prose.
Strategic placement. Given the lost-in-the-middle effect, put the most critical information at the beginning or end of the context. This is a free optimization that costs nothing to implement and can meaningfully improve output quality.
External memory. Not everything needs to live in the context window. Preferences, reference documents, and historical data can live in external systems that the model queries on demand. The window stays focused on the current task. (For more on why this matters, see Why does ChatGPT forget everything?)
The context window is a constraint worth understanding, but it’s one piece of a larger puzzle. The more interesting question isn’t “how many tokens can this model hold?” It’s “how do I make sure every token counts?”
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container