Definition
What is a Context Window?
Last updated
The maximum amount of text (measured in tokens) that a language model can process in a single inference call.
Think of it as working memory: everything the model can see while generating a response. Once the window is full, older content gets pushed out or truncated. System prompt, conversation history, retrieved documents, tool definitions, and the model's own output all compete for the same budget.
- Measured in tokens, not words (a token is roughly three-quarters of a word in English).
- Everything competes for space: system prompt, history, retrieved docs, tool outputs, model response.
- Effective capacity is usually 60-70% of the advertised number, not 100%.
- Information placement matters: content in the middle of the window gets less attention than the edges.
- A focused small context often beats a sprawling large one on the same task.
How context windows work
A context window is the total amount of text a language model can process in a single request. The unit is tokens, not words. Everything the model sees competes for this budget:
- the system prompt
- conversation history
- retrieved documents (RAG results)
- tool definitions and their returned outputs
- long-term memory snippets
- the model’s own response as it generates
Once the window fills, older content must be truncated, summarized, or dropped. That’s why long agent sessions can “forget” early instructions and why tool-heavy workflows can run out of room mid-task.
Modern windows have grown dramatically. Claude Opus 4.6 offers a 1M token beta. Gemini 3 Pro reaches 1-2M tokens via Vertex AI. GPT-5.2 sits at 400K. A 1M token window holds roughly 7,500 pages of text.
Why window size isn’t the real bottleneck
Research consistently shows that models don’t use their full context window effectively.
- Models degrade before the limit. Elvex’s 2026 benchmarks found effective capacity is roughly 60-70% of advertised maximum. The drop is often sudden rather than gradual.
- Middle content gets lost. Stanford’s “lost in the middle” study showed a 15-20 percentage point accuracy drop for information placed mid-context versus at the edges.
- Simple tasks get harder with more context. Chroma’s context rot research tested 18 leading models and found accuracy dropping from 95% to 60-70% on trivial retrieval tasks purely as input length grew.
- Benchmarks overstate performance. NVIDIA’s RULER benchmark showed most models claiming 32K+ windows couldn’t effectively handle 32K on realistic tasks.
The shift in production thinking is from “how much fits” to “what should go in.” A focused 5,000-token context often outperforms a sprawling 100,000-token context because the model can attend to all of it.
Common misconceptions about context windows
- “Bigger is always better.” Bigger raises the ceiling on what fits, not on what the model attends to. Cost scales with every token included, whether the model uses it or not.
- “Passing the needle-in-a-haystack test means the window works.” Needle tests are easy. Realistic multi-step reasoning over full windows is much harder and is where most models fall apart.
- “The order of information doesn’t matter.” It matters a lot. Place the most important content at the start or end of the input; don’t bury critical instructions or evidence in the middle.
- “1M tokens means I can load my whole codebase.” You can load it. The model won’t reason over it well. Selective retrieval still wins.
Context windows and Wire
Wire is designed around the reality that window size is a constraint, not a solution. Files uploaded to a container are chunked, embedded, and exposed through wire_search, so agents pull only the relevant passages into their context rather than loading whole documents. wire_explore returns structured summaries that keep tool outputs compact. The goal is to keep your agent’s window populated with the information that matters for the current step, not with every file you’ve ever given it.
FAQ
Frequently asked questions
Common questions about Context Window.
What is a token?
Do bigger context windows make RAG obsolete?
What is the 'lost in the middle' effect?
Why does the advertised context window not match real performance?
How much of my context window should I actually use?
Further reading
Articles about Context Window
Does AI token usage scale with knowledge base size?
AI token usage scales with knowledge base size only when the full corpus loads per query. The real variable is selective context delivery, not KB size.
Context budgets: how to allocate tokens for AI agents
A practical guide to context budgets for AI agents. How to allocate tokens across system prompts, tools, retrieval, history, and a buffer in production.
Long Context Didn't Kill RAG. Here's What the Data Shows.
Long context windows haven't replaced RAG. New 2026 benchmarks reveal the cost, speed, and accuracy tradeoffs, and when each approach wins in production.
Context compression: why less context means better AI
Context compression reduces AI agent memory usage by 26-54% while preserving task performance. Here's how it works and why bigger context windows aren't the answer.
All terms
View full glossaryPut context into practice
Create your first context container and connect it to your AI tools in minutes.
Create Your First Container