Context budgets: how to allocate tokens for AI agents
Key takeaway
Claude Fable 5 ships safety classifiers that can decline a request mid-task and a fallback system that retries it on Claude Opus 4.8, which makes mid-conversation model swaps a designed behavior rather than an edge case. The supporting mechanics are a direct statement about context portability: prompt caches are per-model and rebuilding one is priced (then refunded via fallback credit), thinking blocks are dropped at the model boundary, and a fallback content block marks exactly where one model's context ends. The context engineering lesson is to keep agent state outside the message list, where a model swap costs a re-read instead of a corrupted history.
Claude Fable 5 shipped on June 9 with safety classifiers that can decline a request mid-task and an API for retrying the declined request on Claude Opus 4.8. Most coverage filed this under safety. Read the developer docs and something more interesting emerges: to make mid-conversation model swaps work, Anthropic had to specify exactly which parts of an agent’s context survive the move, build a billing mechanism to refund the cost of re-establishing context on the new model, and invent a content block whose only job is to mark where one model’s context ends and another’s begins.
That is the most explicit statement any vendor has made about context portability, and it came with a price tag attached. This post walks through the fallback mechanics, what they reveal about which context is model-bound, and what that means for how agents should hold state.
A Fable 5 refusal is a successful HTTP 200 response with stop_reason: "refusal", and the sanctioned recovery is to rerun the request on Claude Opus 4.8. The response carries a stop_details object naming a category (cyber, bio, frontier_llm, or reasoning_extraction) and an explanation, though both can be null. The category descriptions are candid about false positives: Anthropic notes that benign cybersecurity work can trigger cyber and beneficial life-sciences work can trigger bio, which is precisely why a fallback path matters for legitimate workloads, not just adversarial ones. Anthropic says the classifiers trigger in under 5% of sessions, and a refusal that arrives before any output costs nothing and consumes no rate limits.
The retry has three supported shapes. A server-side fallbacks parameter, in beta, names up to three models and retries inside a single API call. SDK middleware in five languages does the same from the client on any platform. And a manual path exists for everyone else. At launch, Fable 5’s only permitted fallback target is Opus 4.8, published as allowed_fallback_models in the Models API.
Two design details show how seriously Anthropic expects swaps to happen in production. First, sticky routing: once a conversation falls back, the API remembers (for about an hour, as a content hash of the conversation prefix) which model served it, and routes later turns straight to that model rather than paying for a predictable refusal on every turn. Second, the docs warn that one user turn can produce several refusals because agents fan out into sub-agent calls, and the fallbacks parameter does not propagate into model calls made inside tool execution. Every request path needs its own fallback configuration. Mid-task model swapping is not an edge case in this design. It is a budgeted, routed, instrumented behavior.
There is an observability trap here too. A refusal is an HTTP 200, so monitoring built on error rates and 5xx counts never sees one. Anthropic’s guidance is to emit one event per refusal and one per fallback-served response and alert on the gap, which is worth implementing on day one rather than after the first silent degradation.
Prompt caches are per-model, so when a conversation moves from Fable 5 to Opus 4.8, the entire cached conversation prefix must be written into the new model’s cache from scratch, and Anthropic built a refund mechanism because that cost is real. The refusal carries a fallback_credit_token, the retry echoes it, and the retry is billed “as though the conversation had been on the new model all along.” Cache writes cost more than cache reads, and on a long agentic session the difference is not small: a session holding hundreds of thousands of tokens of cached context pays the full write rate to re-establish itself on the fallback model.
Look at what the credit’s redemption rules require. The retry must match the refused request exactly on every field that shapes the prompt: system, messages, tools, tool_choice, thinking, cache_control. The token expires in five minutes and redeems only from the organization that received the refusal. This is a precise, audited definition of “the same context on a different model,” enforced at the billing layer.
The framing worth sitting with: a vendor has now priced the act of moving an agent’s working context from one model to another, found the price high enough to matter, and built infrastructure to refund it. Context portability stopped being an abstract design virtue and became a line item. Anything your agent holds only in the message list is an asset denominated in a single model’s cache, and the exchange rate to any other model is a full rewrite.
When a fallback happens mid-output, the response contains a fallback content block marking the boundary, and Anthropic’s table of what to keep and drop when echoing the turn is effectively a portability spec for agent context. Condensed:
| Block type before the model boundary | Survives the swap? |
|---|---|
text output | Yes |
server_tool_use paired with its result | Yes |
thinking / redacted_thinking | No, dropped |
Client-side tool_use without a result | No, dropped |
The fallback marker itself | Must stay, exactly in place |
The pattern is not arbitrary. Everything that survives is observable output: text the model committed to, tool calls that completed and returned results. Everything that dies is model-internal state: reasoning blocks that only the producing model can interpret, tool intentions that never resolved. Fable 5 sharpens this because raw chain of thought is never returned at all; thinking blocks are summaries or empty placeholders, and at a model boundary even those are discarded.
This is the same lesson the multi-agent literature keeps producing, now enforced by an API contract: handoffs corrupt whatever context was implicit. A model swap is a handoff in which the receiving party happens to be a different model, and the API formalizes that the receiver gets your durable artifacts and none of your predecessor’s working memory.
The practical conclusion is to treat the message list as a cache and keep the agent’s real state somewhere a model swap cannot touch. If a swap can happen on any request, then any state that exists only inside the conversation is state your agent can lose mid-task, or pay to reconstruct. Three patterns put state on the right side of the boundary:
Files via the memory tool. Anthropic’s own answer ships with Fable 5: the memory tool persists state to files outside the context window, and Anthropic’s launch evaluations credit file-based memory with Fable 5’s largest measured gains on long-horizon tasks. State written to a file before a refusal is identical after the fallback, whichever model is reading it.
A store the agent queries instead of carries. Context containers hold working knowledge outside the conversation entirely. An agent that keeps its project state in a Wire container loses nothing at the fallback boundary, because the container answers a wire_search call identically whether the caller is Fable 5 or Opus 4.8; reconnecting costs one query returning a few thousand tokens of matched entries, not a full cache rewrite of the conversation prefix. The same property is what makes context move across tools, and a model swap is just the smallest possible version of a tool migration.
Structured turn summaries. For state that must live in the conversation, prefer explicit structured context over implicit accumulation: a compact, typed summary of decisions and open threads that any model can parse, refreshed as the task progresses. It is the same discipline compaction applies retroactively, done proactively, and it doubles as the part of the prompt that survives a swap with its meaning intact.
The common thread is context offloading: the less of your agent’s identity lives in the message list, the less a model boundary can take from it. Offloaded state also happens to be the part you can inspect, version, and audit, which matters for more than fallback.
The deeper shift is that Fable 5 makes every integration that uses it a multi-model integration, whether the developer planned one or not. A request can be served by Opus 4.8 at any time, by design, with sticky routing keeping the conversation there for an hour. Teams that assumed one model per session now have an API that swaps models for them, and the teams that handle it well will be the ones whose context was never model-bound to begin with.
That is worth internalizing beyond Anthropic’s ecosystem. Refusal-triggered fallback is one reason a session changes models mid-task; cost routing, rate limits, provider outages, and A/B evaluations are others, and none of them come with a fallback credit. The Fable 5 docs are the first place the rules of a mid-task swap have been written down precisely, and the rules say: output survives, reasoning does not, caches are local to a model, and re-establishing context has a price. Design your agent’s context so that price stays near zero.
Sources: Refusals and fallback (Claude API Docs) · Fallback credit (Claude API Docs) · Introducing Claude Fable 5 and Claude Mythos 5 (Claude API Docs) · Introducing Claude Fable 5 and Claude Mythos 5 (Anthropic)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container