Every AI model advertises a number: 200K tokens, 1 million tokens, 2 million tokens. These numbers keep climbing, and marketing teams treat each jump like a breakthrough. But what does “context window” actually mean, and why does the number on the box matter less than you think?

The basics

A context window is the total amount of text a language model can process in a single request. Think of it as working memory: everything the model can “see” while generating a response. Once the window is full, older content gets pushed out or truncated.

The unit of measurement is tokens, not words. A token is roughly three-quarters of a word in English, so 1,000 tokens is about 750 words. A 200K token window holds roughly 150,000 words, or about 500 pages of text.

Everything competes for space in this window: your system prompt, the conversation history, any documents retrieved by RAG, tool definitions, and the model’s own output. The “context window size” you see advertised is the total budget for all of it.

Where things stand today

Context windows have grown dramatically over the past two years:

ModelContext windowNotes
Claude Opus 4.61M tokensBeta; 200K standard
Gemini 3 Pro1-2M tokens2M via Vertex AI
GPT-5.2400K tokensUp from 128K in GPT-4

These are large numbers. A 1 million token window can hold roughly 7,500 pages of text. In theory, you could load an entire codebase, a full legal discovery set, or years of customer support tickets into a single prompt.

In practice, that’s rarely a good idea.

Why size isn’t everything

Research consistently shows that models don’t use their full context window effectively. The gap between advertised capacity and effective capacity is substantial.

Models degrade before hitting their limit

Elvex’s 2026 benchmarks found that effective capacity is roughly 60-70% of the advertised maximum. A model with a 200K token window typically becomes unreliable around 130K tokens. The drop is often sudden rather than gradual: performance holds steady, then falls off a cliff.

Information in the middle gets lost

Stanford researchers documented the “lost in the middle” effect, showing that models handle information at the beginning and end of their context far better than information in the middle. Their experiments found a 15-20 percentage point accuracy drop for middle-positioned content. Where you place information in the window matters as much as whether it fits.

Simple tasks get harder with more context

Chroma’s research on context rot tested 18 leading models on trivially simple tasks (basic retrieval, text replication, fact extraction) and found accuracy dropped from 95% to 60-70% as input length increased. The tasks didn’t get harder. The only variable was how much text surrounded the answer. More context meant worse results on the same question.

Benchmarks overstate real-world performance

NVIDIA’s RULER benchmark tested models on tasks beyond simple needle-in-a-haystack retrieval and found that most models claiming 32K+ token windows couldn’t effectively handle even 32K tokens on realistic tasks. Only a handful maintained acceptable performance, and even the best (GPT-4 at the time) showed a 15-point degradation at 128K. Passing a needle-in-a-haystack test, where a model finds one isolated fact in padding text, does not mean a model can reason across a full window of real information.

What matters more than window size

If bigger windows don’t automatically mean better results, what does? The shift in thinking is from “how much fits” to “what should go in.” This is the core of context engineering: designing systems that deliver the right information at the right time.

Selective retrieval. Instead of loading everything into the window, retrieve only what’s relevant to the current query. This is what RAG systems do when implemented well. A focused 5,000-token context often outperforms a sprawling 100,000-token context because the model can attend to all of it effectively.

Structured context. Raw text dumps are harder for models to navigate than organized, structured information. JSON, XML, databases with queryable fields, or purpose-built context containers give models a clearer signal-to-noise ratio than walls of unformatted prose.

Strategic placement. Given the lost-in-the-middle effect, put the most critical information at the beginning or end of the context. This is a free optimization that costs nothing to implement and can meaningfully improve output quality.

External memory. Not everything needs to live in the context window. Preferences, reference documents, and historical data can live in external systems that the model queries on demand. The window stays focused on the current task. (For more on why this matters, see Why does ChatGPT forget everything?)

Takeaways

  1. Treat advertised context windows as a ceiling, not a target. Effective capacity is 60-70% of the stated maximum. Plan accordingly.
  2. Curate what goes into the window. A smaller, focused context reliably outperforms a larger, noisy one. If you’re filling the window, you’re probably including too much.
  3. Design for context quality, not context volume. The organizations getting the best results from AI are spending more time on what enters the window than on finding models with bigger windows.

The context window is a constraint worth understanding, but it’s one piece of a larger puzzle. The more interesting question isn’t “how many tokens can this model hold?” It’s “how do I make sure every token counts?”

References

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Get Started