Definition

What is a Context Window?

Last updated

The maximum amount of text (measured in tokens) that a language model can process in a single inference call.

Think of it as working memory: everything the model can see while generating a response. Once the window is full, older content gets pushed out or truncated. System prompt, conversation history, retrieved documents, tool definitions, and the model's own output all compete for the same budget.

  • Measured in tokens, not words (a token is roughly three-quarters of a word in English).
  • Everything competes for space: system prompt, history, retrieved docs, tool outputs, model response.
  • Effective capacity is usually 60-70% of the advertised number, not 100%.
  • Information placement matters: content in the middle of the window gets less attention than the edges.
  • A focused small context often beats a sprawling large one on the same task.

How context windows work

A context window is the total amount of text a language model can process in a single request. The unit is tokens, not words. Everything the model sees competes for this budget:

  • the system prompt
  • conversation history
  • retrieved documents (RAG results)
  • tool definitions and their returned outputs
  • long-term memory snippets
  • the model’s own response as it generates

Once the window fills, older content must be truncated, summarized, or dropped. That’s why long agent sessions can “forget” early instructions and why tool-heavy workflows can run out of room mid-task.

Modern windows have grown dramatically. Claude Opus 4.6 offers a 1M token beta. Gemini 3 Pro reaches 1-2M tokens via Vertex AI. GPT-5.2 sits at 400K. A 1M token window holds roughly 7,500 pages of text.

Why window size isn’t the real bottleneck

Research consistently shows that models don’t use their full context window effectively.

  • Models degrade before the limit. Elvex’s 2026 benchmarks found effective capacity is roughly 60-70% of advertised maximum. The drop is often sudden rather than gradual.
  • Middle content gets lost. Stanford’s “lost in the middle” study showed a 15-20 percentage point accuracy drop for information placed mid-context versus at the edges.
  • Simple tasks get harder with more context. Chroma’s context rot research tested 18 leading models and found accuracy dropping from 95% to 60-70% on trivial retrieval tasks purely as input length grew.
  • Benchmarks overstate performance. NVIDIA’s RULER benchmark showed most models claiming 32K+ windows couldn’t effectively handle 32K on realistic tasks.

The shift in production thinking is from “how much fits” to “what should go in.” A focused 5,000-token context often outperforms a sprawling 100,000-token context because the model can attend to all of it.

Common misconceptions about context windows

  • “Bigger is always better.” Bigger raises the ceiling on what fits, not on what the model attends to. Cost scales with every token included, whether the model uses it or not.
  • “Passing the needle-in-a-haystack test means the window works.” Needle tests are easy. Realistic multi-step reasoning over full windows is much harder and is where most models fall apart.
  • “The order of information doesn’t matter.” It matters a lot. Place the most important content at the start or end of the input; don’t bury critical instructions or evidence in the middle.
  • “1M tokens means I can load my whole codebase.” You can load it. The model won’t reason over it well. Selective retrieval still wins.

Context windows and Wire

Wire is designed around the reality that window size is a constraint, not a solution. Files uploaded to a container are chunked, embedded, and exposed through wire_search, so agents pull only the relevant passages into their context rather than loading whole documents. wire_explore returns structured summaries that keep tool outputs compact. The goal is to keep your agent’s window populated with the information that matters for the current step, not with every file you’ve ever given it.

FAQ

Frequently asked questions

Common questions about Context Window.

What is a token?
A token is the unit a language model reads and generates. In English, one token is roughly three-quarters of a word, so 1,000 tokens is about 750 words. A 200K context window holds roughly 150,000 words, or about 500 pages.
Do bigger context windows make RAG obsolete?
No. Long context helps at the margins, but accuracy still degrades as input length grows. Chroma's research shows model accuracy dropping from 95% to 60-70% on trivially simple tasks purely as input length increases. RAG remains valuable because selective retrieval produces focused context, which the model reasons over more reliably than a sprawling prompt.
What is the 'lost in the middle' effect?
Stanford researchers documented that models handle information at the beginning and end of their context much better than information in the middle, with a 15-20 percentage point accuracy drop for middle-positioned content. Where you place information in the window can matter as much as whether it fits.
Why does the advertised context window not match real performance?
Needle-in-a-haystack tests (finding one isolated fact in padding text) are easy and inflate reported capacity. NVIDIA's RULER benchmark showed that most models claiming 32K+ token windows couldn't effectively handle even 32K tokens on realistic multi-step tasks.
How much of my context window should I actually use?
A common heuristic is to plan for 60-70% of the advertised limit as effective capacity, leave headroom for the model's response, and keep the most important information at the start or end of the input rather than buried in the middle.

Put context into practice

Create your first context container and connect it to your AI tools in minutes.

Create Your First Container