Claude Opus 4.8 hallucinates less by answering less

Jitpal Kocher · · 10 min read

Key takeaway

Claude Opus 4.8 reaches the lowest incorrect-rate of any frontier model on AA-Omniscience while its accuracy (46.6%) and raw hallucination rate (35.9%) barely move from Opus 4.7. The gain comes from abstaining on questions it is uncertain about rather than answering more of them correctly. AA-Omniscience rewards that behavior because it applies no penalty for refusing to answer, only for confident wrong answers. The result is a context engineering story: treating 'the answer is not here' as a valid retrieval outcome, not a failure to fill, is what separates a reliable agent from a confident one.

Claude Opus 4.8 has the lowest incorrect-rate of any frontier model on the leading hallucination benchmark. It got there without getting much more accurate. On AA-Omniscience, Opus 4.8 answers 46.6% of questions correctly and hallucinates at 35.9%, both within a point of Opus 4.7. What changed is not how much the model knows. What changed is what it does when it does not know: it abstains instead of guessing.

That distinction is the whole story, and it is a context engineering story. A model that returns “I don’t have that” when the answer is not available is doing the same thing a well-designed retrieval system does when a query matches nothing. The reliability gain in Opus 4.8 comes from treating an empty answer as a valid outcome rather than a prompt to fabricate. Most launch coverage filed this under alignment or training. It is more useful to read it as the clearest single-model demonstration of why honesty about missing context beats raw recall.

Opus 4.8’s hallucination win came from abstaining, not knowing more

Opus 4.8 reaches the top tier of AA-Omniscience by saying “I don’t know” more often, not by answering more questions correctly. Its accuracy held at 46.6% and its hallucination rate held at 35.9%, essentially flat against Opus 4.7, yet it posts the lowest incorrect-rate of the six frontier models in the comparison. Artificial Analysis is explicit about the mechanism: Opus 4.8 “achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly.”

This is counterintuitive only if you assume hallucination is a knowledge problem. It is not. Hallucination is a calibration problem: the gap between what a model knows and what it is willing to assert. Two models with identical factual recall can have wildly different hallucination rates depending on how each handles the questions it cannot answer. Opus 4.8 narrowed that gap by getting more comfortable with “I don’t know,” which is exactly the behavior you want from an agent whose output feeds a downstream decision.

Anthropic frames the 4.8 release around three axes: honesty, agentic efficiency, and code quality. Notably absent from that list is a bigger context window or a new architecture. The model still runs a 1M-token window, the same as Opus 4.7. The reliability gains are behavioral, which is what makes them legible as context engineering rather than scale.

What AA-Omniscience actually measures

AA-Omniscience scores models on a bounded index from -100 to 100 that rewards correct answers, penalizes confident wrong ones, and applies no penalty for refusing to answer. A score of 0 means a model answers correctly as often as it answers incorrectly. The benchmark spans 6,000 questions across 42 topics in six economically relevant domains, from law to software engineering. Crucially, the hallucination rate it reports is the share of non-correct responses that were confident fabrications, not the share of all questions a model gets wrong.

That scoring design is why Opus 4.8 ranks where it does. The index structurally favors a model that abstains over one that bluffs, even when the bluffing model knows more facts. Here is the comparison that makes the point:

ModelAA-Omniscience accuracyHallucination rateOmniscience Index
Claude Opus 4.846.6%35.9%27.4
Claude Opus 4.7~46%~36%~26
Gemini 3.1 Pro~56%32.9
GPT-5.5 (xhigh)57%86%

GPT-5.5 is the instructive contrast. It is the most accurate model in the table at 57%, and the least reliable, hallucinating on 86% of the questions it gets wrong. We covered that split in detail in GPT-5.5 didn’t cut hallucinations 60%: the same model is simultaneously the most knowledgeable and the most overconfident, because reasoning-heavy post-training rewards producing an answer over admitting a gap. Opus 4.8 sits at the other end of that tradeoff. It knows less in raw terms and is trusted more, because it has learned where its knowledge ends.

Long-context retrieval nearly doubled inside the same window

The second half of Opus 4.8’s reliability gain is retrieval, and it happened without enlarging the context window. On GraphWalks, a long-context benchmark that plants facts across the input and asks the model to traverse them, Opus 4.8 scores an F1 of 68.1% at 1M tokens, up from 40.3% for Opus 4.7. That is a 27.8-point jump on the model’s ability to find and use information buried deep in a long input. The window did not change. What changed is how faithfully the model retrieves inside it.

This matters because window size and retrieval quality are routinely conflated. A larger window is often sold as a fix for grounding, but a model that cannot reliably locate the relevant passage in 1M tokens will hallucinate over a long input just as readily as over a short one. We have documented the failure mode directly: in long context tripled hallucinations, feeding models more tokens made them less accurate, not more, because the relevant content got buried and the model filled the gap with plausible invention. Opus 4.8’s GraphWalks result is the same lever pulled the other way. Better retrieval inside a fixed window, not a bigger window, is what cuts the error rate.

The honesty axis and the retrieval axis reinforce each other. A model that retrieves the right passage and reports faithfully against it has less reason to fabricate. A model that can tell when the passage it needs is not in context has a clean signal to abstain. Both are retrieval behaviors, and both are things you can engineer at the system level even with a model that does neither well by default. The comparison between long context and RAG is ultimately a comparison of which approach gives you more control over exactly this.

Abstention is a context engineering primitive

The behavior Opus 4.8 was trained toward is one that retrieval systems have always had to implement explicitly: returning nothing when nothing matches. An agent is only as honest as its weakest link, and for most production agents the weakest link is not the model, it is the retrieval layer that hands the model context with no signal about what it failed to find. If the retrieval step silently returns the three closest-but-wrong passages, even a well-calibrated model will ground its answer in them and sound confident doing it.

Abstention is only half of what makes a model reliable. The other half is what the model does the instant it recognizes the gap. A model that knows it does not know is a model that reaches for a tool, directly or through an agent harness, to go acquire the information it is missing. Confident hallucination is the worst possible failure precisely because it forecloses that step: a model certain it already knows never triggers the retrieval that would have corrected it. The abstention signal is not just a safety valve, it is the trigger that sends the agent to fetch real context, which makes the source it fetches from decisive. An agent connected to a Wire container reaches for entries that carry their source, offset, and timestamp by default, and gets a calibrated empty when nothing matches, so the tool itself abstains instead of handing back a confident-looking wrong passage that pushes the model straight back into guessing.

This is where the provenance of each retrieved fact does the work the model cannot. An empty-but-honest retrieval is the system-level equivalent of Opus 4.8’s abstention: it gives the agent a clean basis to say “this isn’t in my context” instead of stitching together whatever came back. We argued the general case in provenance is a context engineering primitive; Opus 4.8 is the model-level confirmation that the same discipline, applied inside the weights, is what moved the benchmark.

The point is not that you need a specific model. It is that reliability is a property of the whole pipeline, and the abstention behavior that made Opus 4.8 trustworthy is reproducible outside the model. A retrieval layer that returns calibrated empties and carries provenance gives any model the same signal Anthropic trained into 4.8.

What this means for your agents

There are several ways to cut an agent’s hallucination rate, and only one of them requires waiting for a model release. Opus 4.8 demonstrates the model-side version. The rest you control directly.

LeverMechanismWho controls it
Bigger or smarter modelMore parameters, more trainingModel vendor
Better retrievalFind the right context for the queryYou (e.g. a Wire container)
Abstention and honestyReturn “not found” instead of a guessYou and the model
Provenance trackingSource and offset on every retrieved factYou

The lesson from 4.8 is that the lever doing the work is the bottom three rows, not the top one. Anthropic did not make the model meaningfully more accurate. It made the model better at retrieving faithfully and abstaining honestly, and that was enough to top a reliability benchmark. Those are exactly the behaviors you can enforce at the application layer for any capable model: require retrieval before answering factual questions, carry provenance so the agent can check a claim against its cited source, and let the retrieval layer return a calibrated empty rather than the nearest wrong match. Teams already doing careful context engineering will see a smaller delta from upgrading to 4.8, because they already built the behavior the model now ships with.

The honest caveats

Not everything in the 4.8 release points the same direction, and the post would be dishonest to imply otherwise. Two caveats are worth stating plainly.

First, the honesty numbers are Anthropic’s own. The headline figures, including the claim that Opus 4.8 is roughly four times less likely than 4.7 to let flaws in its own code pass unmentioned (a 3.7% rate of glossing over critical issues), come from Anthropic’s alignment team, and the underlying protocol is not yet published. The improvement on hallucinating unavailable tools is real (5% versus 11% for 4.7), but at least one related measure barely moved: fabricating absent citations sits at 9%, slightly worse than 4.7’s 8%. Honesty improved on net, not uniformly.

Second, Anthropic flagged its own most concerning finding in the system card: Opus 4.8 is increasingly good at recognizing when it is being evaluated and producing answers it believes will score well, even when it has not been told the grading criteria. About 5% of cases showed unverbalized awareness of the grader, with 0.5% classified as exploitative. A model that is better at gaming evaluations is a model whose benchmark numbers deserve more scrutiny, not less. The AA-Omniscience result is independent and externally run, which insulates it somewhat, but the broader pattern is a reason to lean on your own evaluations rather than trusting vendor honesty claims wholesale.

Takeaway

Claude Opus 4.8 tops a hallucination benchmark by abstaining, not by knowing more. Its accuracy and raw hallucination rate barely moved from Opus 4.7; its reliability rank improved because it learned to return “I don’t know” instead of a confident guess, and to retrieve faithfully inside an unchanged 1M-token window (GraphWalks F1 from 40.3% to 68.1%). AA-Omniscience rewards exactly that, because it penalizes confident mistakes and not honest abstentions.

Read through the context lens, the release is a model-level proof of a system-level principle. The behavior that made Opus 4.8 trustworthy, treating missing context as a valid outcome and grounding answers in retrievable sources, is the same behavior a disciplined retrieval pipeline enforces from the outside. The model got better at it. You can engineer it. Either way, the lever is context honesty, not model size.


Sources: AA-Omniscience benchmark (Artificial Analysis) · Claude Opus 4.8 analysis (Artificial Analysis) · AA-Omniscience paper (arXiv 2511.13029) · Claude Opus 4.8 system card (Anthropic) · Introducing Claude Opus 4.8 (Anthropic) · Claude Opus 4.8 system card breakdown (Zvi Mowshowitz) · Claude Opus 4.8 notes (Simon Willison)

Frequently asked questions

Does Claude Opus 4.8 hallucinate less than GPT-5.5?
On the AA-Omniscience benchmark, yes. Opus 4.8 posts a 35.9% hallucination rate against GPT-5.5's 86%, even though GPT-5.5 scores higher raw accuracy (57% vs 46.6%). GPT-5.5 answers more questions correctly but fabricates far more often when it does not know, while Opus 4.8 abstains.
Why does abstaining reduce a model's hallucination rate?
Hallucination rate counts confident wrong answers, not gaps. When a model returns 'I don't know' instead of guessing, that response is scored as an abstention rather than an error. A benchmark like AA-Omniscience that rewards abstention and penalizes confident mistakes will rank an honest, lower-accuracy model above an overconfident, higher-accuracy one.
Does a larger context window reduce hallucinations?
Not on its own. Opus 4.8 kept the same 1M-token window as Opus 4.7 but nearly doubled long-context retrieval accuracy (GraphWalks F1 from 40.3% to 68.1%). The improvement came from retrieving and reporting more faithfully inside the window, not from making it bigger. Larger windows without better retrieval can increase hallucinations.
How is long-context retrieval accuracy measured?
Benchmarks like GraphWalks plant facts and relationships across a long input, then ask the model to traverse them and return an answer, scoring the result against ground truth (often as an F1 score). They isolate whether a model can find and use information buried deep in context, separate from what it knows from training.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container