When an AI confidently states something false, the instinct is to blame the model. A better model would hallucinate less, the thinking goes, so the fix is to wait for the next release.
The data tells a different story. Vectara’s hallucination leaderboard, the industry’s most widely cited benchmark, tests how often models introduce false information when summarizing documents they were explicitly given. Across 70+ models tested against 7,700 articles spanning law, medicine, finance, and technology, hallucination rates range from 1.8% to over 23%. And here’s what’s counterintuitive: the most capable reasoning models hallucinate more on this task, not less. Claude Sonnet 4 hallucinates at 10.3%, GPT-5.2 at 10.8%, and o3-pro at 23.3%, while simpler models like Gemini 2.5 Flash Lite (3.3%) and Qwen 3-8B (4.8%) stay well under 5%.
The task is straightforward: summarize a document using only the facts it contains. The smarter models overthink it, and in doing so, introduce information that wasn’t there.
The biggest lever for reducing hallucinations has less to do with the model and more to do with the context you give it.
Three causes, one you can control
AI hallucinations have three root causes:
Training gaps. The model never learned a particular fact, or learned it incorrectly from noisy training data. OpenAI’s research on why language models hallucinate traces this to the fundamental mismatch between internet-scale training data and factual accuracy. You can’t fix this as a user.
Decoding randomness. Language models generate text probabilistically. Even with correct knowledge, sampling introduces variability that can produce incorrect outputs. You can tune temperature settings, but there’s a floor.
Context failures. The model had access to the right information but couldn’t use it effectively. The context was too long, poorly structured, buried in the wrong position, or fragmented across disconnected sources. This is the cause you can actually do something about.
Most hallucination discussion focuses on the first two. But for teams building AI-powered systems, the third cause is both the most common and the most fixable.
How context causes hallucinations
Three well-documented mechanisms connect poor context to hallucinated outputs.
Attention dilution
As context grows, model performance degrades. Research from Chroma shows accuracy drops from 95% to 60-70% as input length increases, even on trivially simple tasks. This phenomenon, called context rot, means that adding more information to a prompt can actually make outputs less accurate. Every additional paragraph of context dilutes the model’s attention across more tokens. (For a deeper look at the mechanism, see Context Rot: Why AI Performance Degrades With More Information.)
When attention is spread thin, the model is more likely to generate plausible-sounding text from its training data rather than grounding its response in the provided context. That’s a hallucination.
Positional blindness
Stanford’s “lost in the middle” research demonstrated that LLMs have a U-shaped attention curve. Information at the beginning and end of context gets 70-75% accuracy, while information in the middle drops to 55-60%. That’s a 15-20 percentage point gap based purely on where information appears, not what it says.
If the fact that prevents a hallucination happens to sit in the middle of a long prompt, the model may never effectively “see” it. The information is technically present in the context window, but functionally invisible.
Fragmented context
Many AI workflows pull information from multiple sources: documents, databases, conversation history, tool outputs. When relevant facts are scattered across these sources with no coherent structure, the model has to assemble a complete picture from fragments. Research on multi-hop reasoning shows accuracy drops from 84.4% to 31.9% when genuine synthesis across sources is required, as documented in our analysis of RAG failure modes.
The model fills gaps between fragments with its best guess. Sometimes that guess is wrong.
What doesn’t reduce hallucinations
Bigger context windows
Larger context windows (200K, 1M, even 10M tokens) don’t solve the attention dilution problem. They can make it worse. A model with a 1M token window that suffers from severe context rot may produce more hallucinations than a model with a focused 10K token input, because there’s more noise competing for attention.
Naive RAG
Retrieval-augmented generation is often presented as the hallucination fix. And it helps: a 2025 cancer research study found RAG with GPT-4 achieved 0% hallucination on medical queries, compared to 40% without retrieval. But naive RAG, where you vector-search documents and stuff retrieved chunks into the prompt, introduces its own problems. Roughly 70% of retrieved passages don’t contain the information needed to answer the query. Irrelevant chunks add noise and trigger the same attention dilution that causes hallucinations in the first place. (See RAG Is Not Enough for a full breakdown.)
”Don’t hallucinate” prompts
Telling a model to “only use the provided context” or “say I don’t know if you’re unsure” helps at the margins. A 2025 multi-model study found prompt-based mitigation cut GPT-4o’s hallucination rate from 53% to 23%. That’s meaningful, but a 23% hallucination rate is still too high for most production use cases. Prompt instructions are a band-aid on a structural problem.
What actually works: better context
The pattern across all the research points in the same direction. Hallucinations decrease when the context is smaller, better structured, and more relevant. This is context engineering in practice: designing the information that reaches the model, not just the instructions.
Pre-structure your context
Converting documents into structured formats (JSON, typed records, queryable databases) before they reach the model eliminates ambiguity. The model doesn’t have to parse a wall of prose to find the relevant fact. Structured context also makes it easier to include only what’s relevant, keeping the total context size small.
Retrieve less, but better
The most effective RAG implementations are selective. Rather than retrieving 20 chunks and hoping the model finds the right one, they retrieve 3-5 highly relevant pieces and present them clearly. Focused context with high signal-to-noise ratio produces grounded outputs. More context often means more hallucinations.
Process at upload time, not query time
One approach to context engineering is to do the hard work upfront. When documents are uploaded, extract entities, categorize information, and build structured representations. At query time, the system returns pre-processed, organized context rather than raw text. Tools like Wire take this approach, transforming files into structured context at upload time so that agents receive clean, focused information on every query.
Put critical facts first or last
Given the lost-in-the-middle effect, position matters. Place the most important context at the beginning of the prompt, with supporting details at the end. Avoid burying critical facts in the middle of long inputs.
Practical checklist
If you’re building AI systems and want to reduce hallucinations through better context:
- Audit your context length. If you’re routinely sending 50K+ tokens, you’re likely in the context rot zone. Can you reduce it?
- Check what the model actually sees. Retrieve your RAG chunks for real queries. Are they relevant and self-contained?
- Structure over prose. Where possible, give models structured data rather than raw documents.
- Position deliberately. Put critical context at the start. Don’t bury key facts in the middle of long inputs.
- Measure hallucination rate. Track how often your system produces unsupported claims. A baseline tells you if changes help.
The next generation of models will likely hallucinate less on benchmarks. But for production systems today, the fastest path to fewer hallucinations is better context. Not bigger models, not longer context windows, not cleverer prompts. Better, more focused, more structured context.