Why AI Agent Memory Keeps Failing
Key takeaway
A March 2026 study evaluated 35 open-weight LLMs across 172 billion tokens and found long context hallucinations triple from 32K to 128K, with every model exceeding 10% fabrication at 200K. The best model fabricates 1.19% at 32K and 7% at 200K. The worst collapses from 7% to 70%. Hardware platform barely matters; model choice swings accuracy by 72 percentage points; and counterintuitively, higher temperatures reduce fabrication in over half the cases. The takeaway for context engineering is direct: long context is not a free upgrade, and trimming context aggressively is the lever with the highest accuracy return.
Long context tripled hallucinations across 35 frontier open-weight models in the largest deterministic LLM hallucination evaluation ever published. A March 2026 study ran 172 billion tokens of evaluation across NVIDIA H200, AMD MI300X, and Intel Gaudi 3 hardware and measured fabrication rates at 32K, 128K, and 200K context lengths. The headline result: median models lost about 10 percentage points of accuracy from 32K to 128K, lost 24 points from 32K to 200K, and at 200K not a single model held fabrication under 10%.
This is the strongest empirical case yet that context length is the dominant variable in production AI accuracy, ahead of temperature, hardware, and many model-choice decisions. For teams designing AI agents, it changes what “give the model more context” should mean by default.
The study tested 35 open-weight LLMs spanning 1B to 480B parameters using a methodology called RIKER: a ground-truth-first evaluator that scores answers deterministically rather than relying on another LLM as judge. Authors ran 172 billion tokens of evaluation across three context lengths (32K, 128K, 200K), four temperature settings, and three hardware platforms covering the major inference accelerators in production today.
Models came from seven families: DeepSeek, GLM, Granite, Llama, MiniMax, Qwen, and a few others. The evaluation focused on document question-answering, the workload most teams have actually deployed: read a long document, answer questions about it. RIKER classifies outputs into three failure modes: incorrect (got the wrong answer), fabricated (made up a detail not in the source), and coherence loss (looped, repeated, or otherwise broke down).
Crucially, this is a single, internally consistent benchmark across many axes at once. Most prior long-context studies vary one dimension at a time. By holding methodology constant across 35 models and three context lengths, the study lets you compare how a model behaves at 200K versus 32K rather than comparing two different benchmarks.
The single most important number from the study: fabrication rates roughly triple as context expands from 32K to 128K, with median accuracy dropping 10.4 percentage points. Push to 200K and the mean accuracy drop balloons to 23.9 points. At 200K, no model in the 35-model sample held fabrication below 10%.
The contrast across models is dramatic:
| Model | 32K fabrication | 200K fabrication |
|---|---|---|
| GLM 4.5 (best at 32K) | 1.19% | About 7% |
| GLM 4.6 | 7.04% | 69.53% |
| MiniMax M2.1 | 5.06% | Over 10% |
| Median model | About 25% | About 49% |
| Worst small model (Granite 4.0 H Tiny) | 78.27% | Higher |
GLM 4.6 is a particularly stark example. At 32K it fabricates 7% of the time, comparable to other strong models. At 200K it fabricates 69.53%: more than two-thirds of its answers contain made-up details not present in the source document. Same model, same questions, same hardware, three times the context length, ten times the hallucination rate.
This is not a “lost in the middle” effect operating on individual passages. It is a system-level accuracy floor that every tested model hits as input length grows. The mechanism is well understood: transformer attention is a finite budget distributed across tokens, and adding more tokens dilutes the signal carried by any one of them. Long context does not give the model more attention; it stretches the same attention thinner.
The study’s second-largest effect is model selection. Best-to-worst fabrication rates at 32K span 72 percentage points: from GLM 4.5 at 1.19% to Granite 4.0 H Tiny at 78.27%. That gap dwarfs every other variable measured, including temperature and hardware.
For practitioners this matters because the long-context conversation often gets framed as “which model has the biggest window.” A 200K window on a model that fabricates 70% at that length is worse than a 32K window on a model that fabricates 1%. The headline window size is a marketing number; the realized accuracy at a given context length is the production number.
The study also surfaces a quieter point about model families: within the GLM family, accuracy varies sharply between releases (GLM 4.5 versus 4.6 versus 4.7), which means model evaluation has to be done at the version level, not the family level. Picking “GLM” without specifying which one is not a complete model decision.
The most counterintuitive finding in the study: T=0.0 was optimal for accuracy in only about 60% of model-context combinations, and higher temperatures reduced fabrication in 53% of cases. T=0.0 also produced up to 48 times more coherence loss (loops, repetition, format breakdown) at 200K context than T=1.0.
GLM 4.7 at 200K is a clean example: 2.59% coherence loss at T=0.0 versus 0.05% at T=1.0. Setting temperature to zero, the standard “make the model deterministic” advice, produces a model that locks into degenerate output patterns more often on long inputs.
The mechanism is plausible if you think about what greedy decoding does at long context. With diluted attention and weak signal, the highest-probability next token is often a repetition of something nearby, since repetition is a high-probability pattern when the model is uncertain. A small amount of stochasticity breaks that loop without meaningfully degrading factuality.
The practical takeaway is not “set temperature to 1.” It is “test temperature deliberately on your actual inputs,” especially at long context, because the conventional wisdom of T=0 for production was developed on shorter inputs and does not generalize cleanly to 200K-token prompts.
The study evaluated every model on three hardware platforms: NVIDIA H200, AMD MI300X, and Intel Gaudi 3. Maximum cross-platform accuracy deltas ranged from 0.24 to 0.94 percentage points. In other words: hardware affects accuracy by less than 1 percentage point in the worst case, while context length affects it by 24 points in the average case.
This is reassuring for teams considering hardware portability and damning for narratives that frame model behavior as accelerator-specific. Numerical reproducibility is essentially a solved problem at this scale.
It also clarifies the leverage hierarchy for a developer trying to reduce hallucination risk. Roughly:
| Variable | Maximum effect on accuracy |
|---|---|
| Model choice | About 72 percentage points |
| Context length (32K to 200K) | About 24 percentage points |
| Temperature | Single-digit percentage points |
| Hardware | Under 1 percentage point |
Two of the three variables a developer can change in production (context length and model choice) account for nearly all the variance. Hardware does not move the needle.
The 172-billion-token data turns long context from a “more is better” feature into an explicit cost-of-accuracy decision. If your agent currently sends 180K tokens of retrieved documents into every prompt because the window allows it, you are operating on the part of the curve where every model in the study produces double-digit fabrication.
Three concrete implications for context engineering in production:
Treat input length as a hyperparameter, not a maximum. The advertised window is a ceiling, not a target. Most production agents would benefit from running with 30K to 50K tokens of carefully selected context rather than 150K of “just retrieve more.” The accuracy curve in this study is steepest in the 32K to 128K range, which means trimming from 128K back to 32K typically recovers 10 points of accuracy on its own.
Structure beats volume. The study evaluated raw document Q&A. In production, the same length budget can hold either a prose dump or a structured set of typed records, and structured input concentrates signal into fewer, higher-density tokens. A structured context representation of the same information at 30K tokens almost always outperforms a 100K-token prose version, both because attention is less diluted and because the model can rely on schema as a parsing scaffold.
Retrieval quality matters more than retrieval quantity. Five high-relevance chunks beat fifty noisy ones. Beyond the obvious cost angle, the data here shows that retrieval that doubles the input length is actively harmful if it does not double the relevance density.
Wire containers stay on the high-accuracy part of this curve by design. A connected agent reads only the slice of a container that is relevant to its query, not the full corpus, so a container holding a million tokens of source material typically returns 5K to 10K tokens per query. That keeps effective context well inside the range where the 172-billion-token study shows accuracy is highest, regardless of how much material the container actually holds.
The dominant industry narrative through 2024 and 2025 was that bigger context windows would gradually subsume retrieval as a design pattern. The data has consistently disagreed: Stanford’s lost in the middle paper, Chroma’s context rot evaluation, the 2025 long-context-vs-RAG benchmark, and now this 172-billion-token study all point in the same direction.
The pattern is not subtle: every careful empirical study of long-context behavior finds significant accuracy degradation as input length grows, with the degradation showing up earlier than the marketing copy suggests and more severely than benchmark headlines imply.
This study is the most rigorous version of that finding to date because of the scale (172 billion tokens), the breadth (35 models, three hardware platforms, four temperatures), and the determinism (RIKER ground-truth scoring rather than LLM-as-judge). It is the strongest single piece of evidence that context length is a first-class variable in AI accuracy, not a free dimension to expand.
For teams designing AI agents, the practical conclusion is unchanged from earlier research but now hard to dismiss: the highest-return context engineering move in 2026 is not buying access to a larger window. It is sending less to the window you already have.
Sources: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study (arXiv 2603.08274) · Lost in the Middle: How Language Models Use Long Contexts (arXiv 2307.03172) · A Survey of Context Engineering for Large Language Models (arXiv 2507.13334) · Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration (arXiv 2604.04258)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container