Why AI Hallucinations Are a Context Problem
Key takeaway
OpenAI's flagship reasoning model GPT-5.4-pro hallucinates at 8.3% on Vectara's 2026 grounded-summarization benchmark, while its smallest sibling GPT-5.4-nano stays at 3.1%. Every frontier reasoning model from OpenAI, Anthropic, Google, xAI, and DeepSeek exceeds 7% on the same benchmark, while smaller non-reasoning siblings stay under 5%. The cause is a reasoning tax: models trained to infer beyond their inputs carry that behavior into tasks where staying inside the source is the entire point. Context engineering is how to claw back the accuracy that reasoning training spends.
OpenAI’s flagship reasoning model, GPT-5.4-pro, hallucinates at 8.3% on Vectara’s grounded-summarization benchmark. Its smallest sibling, GPT-5.4-nano, hallucinates at 3.1%. Same model family, same March 2026 release window, same training stack. The top-end model is two and a half times worse.
That result is not a quirk. Every frontier reasoning model on the same benchmark clusters between 7% and 23%. Claude Opus 4.5 sits at 10.9%. Gemini-3 Pro at 13.6%. DeepSeek-R1 at 11.3%. o3-Pro at 23.3%. Meanwhile, the top of the leaderboard is held by compact instruction-tuned models from Ant Group, OpenAI, Google, and Microsoft, none of which are marketed as frontier-tier reasoning systems.
There is a name for this pattern: the reasoning tax. Models trained to invest compute in multi-step inference carry that behavior into tasks where staying inside the source document is the entire point. OpenAI can honestly say GPT-5.4 has 33% fewer false claims than GPT-5.2 and still ship a flagship that hallucinates more than its own budget model. The numbers are not in conflict. They are measuring two different things.
The fix is not the next model release. It is context engineering.
GPT-5.4-pro hallucinates at 8.3% because reasoning training rewards producing inference that goes beyond the source. Vectara’s benchmark measures exactly the opposite behavior: staying strictly inside the source. When a reasoning model is asked to summarize a document “using only the information in the given passage,” its training pushes it in the other direction.
The mechanism is the same across vendors. Reasoning models are optimized through reinforcement learning on chains of thought, where longer and more inferential outputs are rewarded. That optimization generalizes. A model rewarded for deriving new conclusions will derive new conclusions even when the prompt forbids inference. Vectara’s own summary of the pattern is blunt: reasoning models “overthink and deviate from source material rather than sticking to what’s in the document.”
Within the GPT-5.4 family, the behavior shows up as a capability gradient. GPT-5.4-nano at 3.1% is instruction-tuned and stays close to the input. GPT-5.4-mini at 5.5% starts introducing more inferential language. GPT-5.4-pro at 8.3% reasons most aggressively and drifts furthest from source fidelity. None of these models is broken. They are doing exactly what their training selected for, and for grounded summarization that is the wrong thing.
The engineering implication is that model size and reasoning capability are the wrong axes to optimize when the job is to stay inside a document. Buying up the stack makes the problem worse, not better.
OpenAI’s GPT-5.4 marketing claim is that individual claims are 33% less likely to be false compared to GPT-5.2, and full responses are 18% less likely to contain any errors. That claim is arithmetically consistent with Vectara’s independent measurement. GPT-5.2 tested at 10.8% on the same benchmark. GPT-5.4-pro at 8.3% is roughly a 23% reduction from that baseline.
Where the marketing slips is in the comparison class. OpenAI is comparing GPT-5.4 to its previous flagship, GPT-5.2. They are not comparing to their smallest model. If the comparison were against GPT-5.4-nano (3.1%), the flagship would look like a regression, not an improvement.
This is the same pattern that appears across vendors. Anthropic shipped Claude Opus 4.7 in April 2026 with hallucination-reduction improvements highlighted in their launch post, all measured against their own previous flagship. Google positions Gemini-3 Pro’s capabilities against Gemini-2.5 Pro (7.0%), but Gemini-3 Pro sits at 13.6% on the same grounded-summarization benchmark that their smaller Flash Lite model handles at 3.3%.
None of these vendors are being misleading. They are reporting real reductions within their flagship line. The missing frame is that the flagship line is not the right line for grounded tasks. Smaller, less inferential models are.
The reasoning tax is not an OpenAI problem, a Google problem, or an Anthropic problem. It shows up wherever a vendor trains a frontier reasoning model and markets it as the most capable option. The data on Vectara’s March 2026 leaderboard snapshot is consistent across every major lab.
| Model family | Flagship reasoning | Rate | Smaller or non-reasoning | Rate |
|---|---|---|---|---|
| OpenAI | GPT-5.4-pro | 8.3% | GPT-5.4-nano | 3.1% |
| Gemini-3 Pro | 13.6% | Gemini-2.5 Flash | 7.8% | |
| Anthropic | Claude Opus 4.5 | 10.9% | Claude Haiku 4.5 | 9.8% |
| DeepSeek | DeepSeek-R1 | 11.3% | DeepSeek V3.2-Exp | 5.3% |
| OpenAI (legacy) | o3-Pro | 23.3% | GPT-4o family | ~5% |
Two observations fall out of this table. The smaller sibling beats the flagship in every family. And the gap widens the more explicitly a model is positioned as “thinking” or “reasoning,” not the other way around.
The business implication is that choosing a model by top-line benchmark score is a poor strategy for grounded tasks. The most expensive frontier model is often the worst choice for summarization, extraction, and question-answering over a provided document. Picking the cheap model is not a cost compromise. It is the correct technical call.
Prompt instructions to avoid inference do not reliably override reasoning training. Vectara’s benchmark prompt is explicit: “Summarize using only the information in the given passage. Do not infer. Do not use your internal knowledge.” Reasoning models still hallucinate at 7% to 23% on that prompt.
A 2025 multi-model study found that prompt-based mitigation cut GPT-4o’s hallucination rate from 53% to 23%. That is a meaningful improvement but still far too high for most production systems. The underlying issue is that reasoning behavior is encoded in weights, not in prompt compliance. A few tokens of instruction cannot consistently suppress a behavior reinforced across millions of training examples.
This is the same reason that “act as a careful assistant” prompts fail to reliably reduce confident-sounding mistakes. The model is not ignoring the instruction. It is producing the behavior its training selected for, which happens to include confident, inferential outputs regardless of the surface-level instruction.
The implication is that prompts are a weak lever. If a grounded task is critical, the structural fixes are routing the task to a non-reasoning model and controlling the context it receives. Both are context engineering decisions, not prompt engineering decisions. (Our earlier post on why hallucinations are a context problem goes deeper on the broader mechanism.)
Context engineering reduces reasoning-model hallucination by changing what reaches the model, not what the model is told. Four moves do most of the work.
Route the task to the right model. For grounded summarization, extraction, and retrieval-augmented question answering, use a non-reasoning model. For multi-step planning, code generation, and open-ended synthesis, use a reasoning model. Treat the choice as a routing decision, not a capability ranking. The top of the Vectara leaderboard is your shortlist for grounded work. Pay the flagship price only for the tasks that actually reward flagship reasoning.
Shrink the context window. Larger windows expose more attention surface area for the model to drift across. Chroma’s research shows accuracy falling from 95% to 60-70% as input length grows, even on trivial tasks. For grounded summarization, a 1,000-token input will typically outperform the same task at 30,000 tokens simply because there is less room for the model to invent. This compounds with the reasoning tax: reasoning models drift more, and longer windows give them more room to drift into.
Pre-structure the source. Raw prose forces the model to parse a wall of text to find the relevant fact. Converting the source into typed records or structured fields leaves less room for invention. ETH Zurich found that concise, structured context files improved agent success rates by 4%, while verbose ones hurt performance by 3%. The format tradeoffs post has the full breakdown.
Track provenance. When an agent returns a claim, the system should know which source, which passage, and which offset the claim came from. Wire containers return entries with explicit provenance (source file, offset, timestamp), so an agent can verify whether the text it is reasoning over is present in the source rather than a paraphrase that drifted during retrieval. This is grounding enforced at the retrieval layer, not at the prompt.
Together these moves address the cause, not the symptom. They do not depend on the model being better. They change the information environment around the model so its failure mode has less room to express itself.
Reasoning models are the right choice when inference beyond the source is the goal, not a failure mode. Tasks where they outperform non-reasoning models include multi-step problem solving, algorithm design, proof construction, code debugging, and open-ended research questions without a fixed source document.
A useful heuristic: if you would grade the output on whether every claim is traceable to a source passage, use a non-reasoning model. If you would grade it on whether the reasoning was sound, use a reasoning model. Most production systems contain both kinds of tasks, which is why model routing has become its own discipline.
The same system can use GPT-5.4-nano for grounded extraction steps and GPT-5.4-pro for downstream planning over the extracted facts. The cost profile improves, and the hallucination profile improves twice: the grounded stage is accurate because a non-reasoning model handled it, and the reasoning stage is accurate because the facts it reasons over were faithful to the source.
GPT-5.4-pro’s 8.3% hallucination rate and GPT-5.4-nano’s 3.1% are not a quality-control failure at OpenAI. They are what happens when reasoning capability and source fidelity are trained for in opposite directions and then shipped under the same product label. The same pattern holds at Anthropic, Google, xAI, and DeepSeek.
The fix is not waiting for the next release. It is treating model choice as a routing decision, shaping context shape and length, and tracking provenance so inferred claims can be caught before they ship. Context engineering is how you claw back the accuracy that reasoning training spends.
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container