Long context tripled hallucinations in 35 open models

Jitpal Kocher · · 9 min read

Key takeaway

A March 2026 study evaluated 35 open-weight LLMs across 172 billion tokens and found long context hallucinations triple from 32K to 128K, with every model exceeding 10% fabrication at 200K. The best model fabricates 1.19% at 32K and 7% at 200K. The worst collapses from 7% to 70%. Hardware platform barely matters; model choice swings accuracy by 72 percentage points; and counterintuitively, higher temperatures reduce fabrication in over half the cases. The takeaway for context engineering is direct: long context is not a free upgrade, and trimming context aggressively is the lever with the highest accuracy return.

Long context tripled hallucinations across 35 frontier open-weight models in the largest deterministic LLM hallucination evaluation ever published. A March 2026 study ran 172 billion tokens of evaluation across NVIDIA H200, AMD MI300X, and Intel Gaudi 3 hardware and measured fabrication rates at 32K, 128K, and 200K context lengths. The headline result: median models lost about 10 percentage points of accuracy from 32K to 128K, lost 24 points from 32K to 200K, and at 200K not a single model held fabrication under 10%.

This is the strongest empirical case yet that context length is the dominant variable in production AI accuracy, ahead of temperature, hardware, and many model-choice decisions. For teams designing AI agents, it changes what “give the model more context” should mean by default.

What the 172-billion-token study measured

The study tested 35 open-weight LLMs spanning 1B to 480B parameters using a methodology called RIKER: a ground-truth-first evaluator that scores answers deterministically rather than relying on another LLM as judge. Authors ran 172 billion tokens of evaluation across three context lengths (32K, 128K, 200K), four temperature settings, and three hardware platforms covering the major inference accelerators in production today.

Models came from seven families: DeepSeek, GLM, Granite, Llama, MiniMax, Qwen, and a few others. The evaluation focused on document question-answering, the workload most teams have actually deployed: read a long document, answer questions about it. RIKER classifies outputs into three failure modes: incorrect (got the wrong answer), fabricated (made up a detail not in the source), and coherence loss (looped, repeated, or otherwise broke down).

Crucially, this is a single, internally consistent benchmark across many axes at once. Most prior long-context studies vary one dimension at a time. By holding methodology constant across 35 models and three context lengths, the study lets you compare how a model behaves at 200K versus 32K rather than comparing two different benchmarks.

Hallucinations triple from 32K to 128K, and exceed 10% at 200K for every model

The single most important number from the study: fabrication rates roughly triple as context expands from 32K to 128K, with median accuracy dropping 10.4 percentage points. Push to 200K and the mean accuracy drop balloons to 23.9 points. At 200K, no model in the 35-model sample held fabrication below 10%.

The contrast across models is dramatic:

Model32K fabrication200K fabrication
GLM 4.5 (best at 32K)1.19%About 7%
GLM 4.67.04%69.53%
MiniMax M2.15.06%Over 10%
Median modelAbout 25%About 49%
Worst small model (Granite 4.0 H Tiny)78.27%Higher

GLM 4.6 is a particularly stark example. At 32K it fabricates 7% of the time, comparable to other strong models. At 200K it fabricates 69.53%: more than two-thirds of its answers contain made-up details not present in the source document. Same model, same questions, same hardware, three times the context length, ten times the hallucination rate.

This is not a “lost in the middle” effect operating on individual passages. It is a system-level accuracy floor that every tested model hits as input length grows. The mechanism is well understood: transformer attention is a finite budget distributed across tokens, and adding more tokens dilutes the signal carried by any one of them. Long context does not give the model more attention; it stretches the same attention thinner.

Model choice swings accuracy by 72 percentage points

The study’s second-largest effect is model selection. Best-to-worst fabrication rates at 32K span 72 percentage points: from GLM 4.5 at 1.19% to Granite 4.0 H Tiny at 78.27%. That gap dwarfs every other variable measured, including temperature and hardware.

For practitioners this matters because the long-context conversation often gets framed as “which model has the biggest window.” A 200K window on a model that fabricates 70% at that length is worse than a 32K window on a model that fabricates 1%. The headline window size is a marketing number; the realized accuracy at a given context length is the production number.

The study also surfaces a quieter point about model families: within the GLM family, accuracy varies sharply between releases (GLM 4.5 versus 4.6 versus 4.7), which means model evaluation has to be done at the version level, not the family level. Picking “GLM” without specifying which one is not a complete model decision.

Higher temperature reduced fabrication in 53% of cases

The most counterintuitive finding in the study: T=0.0 was optimal for accuracy in only about 60% of model-context combinations, and higher temperatures reduced fabrication in 53% of cases. T=0.0 also produced up to 48 times more coherence loss (loops, repetition, format breakdown) at 200K context than T=1.0.

GLM 4.7 at 200K is a clean example: 2.59% coherence loss at T=0.0 versus 0.05% at T=1.0. Setting temperature to zero, the standard “make the model deterministic” advice, produces a model that locks into degenerate output patterns more often on long inputs.

The mechanism is plausible if you think about what greedy decoding does at long context. With diluted attention and weak signal, the highest-probability next token is often a repetition of something nearby, since repetition is a high-probability pattern when the model is uncertain. A small amount of stochasticity breaks that loop without meaningfully degrading factuality.

The practical takeaway is not “set temperature to 1.” It is “test temperature deliberately on your actual inputs,” especially at long context, because the conventional wisdom of T=0 for production was developed on shorter inputs and does not generalize cleanly to 200K-token prompts.

Hardware barely matters; context length is the dominant variable

The study evaluated every model on three hardware platforms: NVIDIA H200, AMD MI300X, and Intel Gaudi 3. Maximum cross-platform accuracy deltas ranged from 0.24 to 0.94 percentage points. In other words: hardware affects accuracy by less than 1 percentage point in the worst case, while context length affects it by 24 points in the average case.

This is reassuring for teams considering hardware portability and damning for narratives that frame model behavior as accelerator-specific. Numerical reproducibility is essentially a solved problem at this scale.

It also clarifies the leverage hierarchy for a developer trying to reduce hallucination risk. Roughly:

VariableMaximum effect on accuracy
Model choiceAbout 72 percentage points
Context length (32K to 200K)About 24 percentage points
TemperatureSingle-digit percentage points
HardwareUnder 1 percentage point

Two of the three variables a developer can change in production (context length and model choice) account for nearly all the variance. Hardware does not move the needle.

What this means for context engineering in production

The 172-billion-token data turns long context from a “more is better” feature into an explicit cost-of-accuracy decision. If your agent currently sends 180K tokens of retrieved documents into every prompt because the window allows it, you are operating on the part of the curve where every model in the study produces double-digit fabrication.

Three concrete implications for context engineering in production:

Treat input length as a hyperparameter, not a maximum. The advertised window is a ceiling, not a target. Most production agents would benefit from running with 30K to 50K tokens of carefully selected context rather than 150K of “just retrieve more.” The accuracy curve in this study is steepest in the 32K to 128K range, which means trimming from 128K back to 32K typically recovers 10 points of accuracy on its own.

Structure beats volume. The study evaluated raw document Q&A. In production, the same length budget can hold either a prose dump or a structured set of typed records, and structured input concentrates signal into fewer, higher-density tokens. A structured context representation of the same information at 30K tokens almost always outperforms a 100K-token prose version, both because attention is less diluted and because the model can rely on schema as a parsing scaffold.

Retrieval quality matters more than retrieval quantity. Five high-relevance chunks beat fifty noisy ones. Beyond the obvious cost angle, the data here shows that retrieval that doubles the input length is actively harmful if it does not double the relevance density.

Wire containers stay on the high-accuracy part of this curve by design. A connected agent reads only the slice of a container that is relevant to its query, not the full corpus, so a container holding a million tokens of source material typically returns 5K to 10K tokens per query. That keeps effective context well inside the range where the 172-billion-token study shows accuracy is highest, regardless of how much material the container actually holds.

Why the long context narrative needs revising

The dominant industry narrative through 2024 and 2025 was that bigger context windows would gradually subsume retrieval as a design pattern. The data has consistently disagreed: Stanford’s lost in the middle paper, Chroma’s context rot evaluation, the 2025 long-context-vs-RAG benchmark, and now this 172-billion-token study all point in the same direction.

The pattern is not subtle: every careful empirical study of long-context behavior finds significant accuracy degradation as input length grows, with the degradation showing up earlier than the marketing copy suggests and more severely than benchmark headlines imply.

This study is the most rigorous version of that finding to date because of the scale (172 billion tokens), the breadth (35 models, three hardware platforms, four temperatures), and the determinism (RIKER ground-truth scoring rather than LLM-as-judge). It is the strongest single piece of evidence that context length is a first-class variable in AI accuracy, not a free dimension to expand.

For teams designing AI agents, the practical conclusion is unchanged from earlier research but now hard to dismiss: the highest-return context engineering move in 2026 is not buying access to a larger window. It is sending less to the window you already have.


Sources: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study (arXiv 2603.08274) · Lost in the Middle: How Language Models Use Long Contexts (arXiv 2307.03172) · A Survey of Context Engineering for Large Language Models (arXiv 2507.13334) · Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration (arXiv 2604.04258)

Frequently asked questions

At what context length do long context hallucinations start increasing significantly?
The sharpest jump in the 172-billion-token study came between 32K and 128K tokens, where fabrication rates roughly tripled across the median model. By 200K every model tested exceeded 10% fabrication, even ones that scored under 2% at 32K. The implication is that the second half of any window past 64K is where accuracy starts paying meaningful tax.
Does setting temperature to zero reduce hallucinations on long inputs?
Not as cleanly as most teams assume. T=0.0 produced the best accuracy in roughly 60% of cases, but it also produced up to 48 times more coherence loss (infinite loops, repetition) at 200K context, and higher temperatures reduced fabrication in 53% of model-context combinations. For agentic workloads on long inputs, T around 0.3 to 0.5 is often safer than zero.
Why does longer context cause more fabrication even when the answer is in the document?
Transformer attention is a finite resource that gets diluted across more tokens, so signal from the relevant span weakens as irrelevant tokens compete with it. Position effects compound this: middle-context content gets attended to less than content at either end. The model still confidently produces an answer, but it is increasingly anchored on training-data priors rather than the input.
How does this study differ from the original lost in the middle paper?
The 2023 lost in the middle paper measured retrieval accuracy by document position in a fixed 32K window. The 172-billion-token study measures fabrication across 35 frontier open-weight models at 32K, 128K, and 200K, on document Q&A with deterministic ground-truth scoring. It is roughly two orders of magnitude larger and tests the modern long-context regime directly.
How do you reduce hallucination risk on long inputs without changing the model?
Cut the context. Retrieve fewer, higher-relevance chunks; structure inputs as typed records or JSON instead of prose; place the most important content at the start or end of the window; and treat 200K-token prompts as a last resort rather than a default. The 172-billion-token data shows the highest-leverage variable available to a developer is input length, not model choice or temperature.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container