Context compression: why less context means better AI
Key takeaway
Prompt caching reuses previously computed token representations so AI agents avoid reprocessing the same context on every API call. Research across 500+ agent sessions found it reduces costs by 41-80% and latency by up to 31%, but only when implemented correctly. Naive full-context caching can paradoxically increase latency. The key is structuring prompts so static content forms a stable prefix, with dynamic content appended at the end.
AI agents read far more than they write. Manus, one of the most widely used AI agent platforms before its $2B acquisition by Meta, reported an average input-to-output token ratio of around 100:1. Every API call sends the same system instructions, the same tool definitions, and the same accumulated context, with only a small slice of new content at the end.
Without caching, every one of those input tokens gets reprocessed from scratch. With caching, the provider reuses previously computed representations and charges a fraction of the cost. On Anthropic’s API, cached input tokens cost $0.30 per million versus $3.00 uncached. That is a 10x difference on the largest component of most agent bills.
Manus called KV-cache hit rate “the single most important metric for a production-stage AI agent.” The research backs this up, but with a caveat: caching done wrong can make things worse, not better.
Every time a language model processes input tokens, it computes key-value (KV) tensors, the internal representations the model uses during attention. This computation is the most expensive part of inference for long inputs.
Prompt caching stores these KV tensors after the first computation. On subsequent requests, if the beginning of the prompt (the prefix) matches exactly, the provider reuses the stored tensors and only computes new ones for the tokens that follow. A cache hit means the model skips straight to processing the new content.
The critical constraint is exact prefix matching. The entire prefix must be identical, token for token. A single changed character invalidates the cache from that point forward, forcing full recomputation of everything after the change.
All major providers now support this natively:
cache_control breakpointsThe savings are substantial because agentic workloads are prefix-heavy. A typical agent session might have a 10,000-token system prompt, 5,000 tokens of tool definitions, and growing conversation history. Only the latest turn is new. Without caching, the provider reprocesses all of it every time.
A January 2026 study, “Don’t Break the Cache,” evaluated prompt caching across 500+ agent sessions on a multi-turn research benchmark. The researchers tested three strategies across Anthropic, OpenAI, and Google:
System-prompt-only caching produced the most consistent savings: 41-80% cost reduction across providers. By caching just the system instructions and tool definitions, the stable prefix stayed intact regardless of what the agent did during the session.
Full-context caching, where everything including tool results was cached, showed mixed results. It sometimes reduced costs further, but in certain conditions it paradoxically increased latency. The problem is that tool outputs change every turn, which can cause cache misses that negate the benefit.
Excluding dynamic tool results from the cached prefix, while caching system prompt and conversation history, struck a middle ground. This strategy provided more consistent benefits than naive full-context caching because it kept the cached prefix more stable.
The latency improvements were also significant. Time to first token improved by 13-31% across providers when cache hits occurred. For agents making dozens of API calls per session, that compounds into noticeably faster task completion.
Real-world results confirm the benchmarks. Thomson Reuters Labs reported 60% cost reduction on their document analysis pipeline. A SaaS startup building an AI writing assistant cut monthly spend from $15,000 to $4,500. An unnamed Anthropic customer achieved 85% cost reduction on their RAG-based customer support system.
The 90% savings headline is real, but only for teams that avoid common implementation mistakes.
A timestamp, a user ID, or any request-specific value placed at the beginning of the prompt will invalidate the entire cache on every request. This is the most common mistake. If your system prompt starts with “Current time: 2026-04-02T14:23:00Z,” every single request computes from scratch. Move all dynamic content to the end of the prompt, after the stable prefix.
The OpenAI Codex CLI demonstrates the right pattern: system instructions, tool definitions, sandbox configuration, and environment context are kept identical and consistently ordered between requests. New messages are appended rather than inserted. The prompt is designed so the prefix never changes.
In most LLM APIs, tool definitions are serialized near the front of the prompt, before or just after the system prompt. When tool definitions change, every token after that point in the cache is invalidated.
This is a particular problem with MCP dynamic tool discovery, where available tools may vary based on connected servers or runtime context. If your agent connects to a new MCP server mid-session and the tool list changes, the entire cached prefix resets. Teams running dynamic tool sets should consider fixing the tool definitions for the duration of a session and loading new tools only at session boundaries.
Context compression and prompt caching work against each other. When you summarize or compact earlier conversation turns to save tokens, you change the cached prefix. The result: you save tokens through compression but lose the caching discount on everything that follows.
This is a real trade-off, not a mistake to avoid entirely. For short sessions (under 20 turns), caching usually wins because the prefix stays stable and the per-token savings are large. For long sessions where context drift becomes a problem, compression may be necessary even at the cost of cache invalidation. The right strategy depends on session length and how quickly context accumulates. Some teams solve this by compressing at fixed intervals (every 50 turns) rather than continuously, minimizing the number of cache resets.
The core rule is simple: static first, dynamic last.
A well-structured agent prompt looks like this:
Each layer changes less frequently than the one below it. The cache covers everything from the top down to the first point of divergence. The more layers you keep stable, the higher your cache hit rate.
On Anthropic’s API, you can set explicit cache_control breakpoints to tell the system exactly where the stable prefix ends. On OpenAI, caching is automatic: the system detects shared prefixes across requests. Google offers configurable TTLs for cached context.
For teams using RAG pipelines, the placement of retrieved documents matters. If retrieval results appear before conversation history, each new retrieval changes the prefix and invalidates the cache. Moving retrieved content after the conversation history, or using a separate retrieval step that appends to the end, preserves the cache on the stable portions.
Prompt caching is not the only caching strategy available. Semantic caching stores complete query-response pairs and returns cached responses when a new query is semantically similar, skipping inference entirely.
The two solve different problems. Prompt caching saves compute on the input side while still running inference. Semantic caching skips inference but only works when a similar enough query has been seen before. For AI agents, where every task is different but the system context is shared, prompt caching provides consistent savings. Semantic caching adds value in customer-facing applications where similar questions recur.
The strongest production setups use both: prompt caching to reduce per-request cost, and semantic caching to eliminate redundant queries entirely.
Prompt caching rewards the same discipline that makes context engineering effective: deliberate structure, minimal tokens, stable architecture. Teams that already practice selective retrieval and structured context tend to get high cache hit rates without additional effort because their prompts are already well-organized with stable prefixes.
The Manus team discovered this firsthand. After rebuilding their agent framework four times, they converged on context engineering as the key discipline, with KV-cache optimization as a natural consequence. When every token in the context window earns its place, the prefix stays stable, the cache hits, and costs drop.
Tools like Wire lean into this pattern. A single wire_explore call returns a stable map of everything in a container, which the agent can cache and reuse throughout the session. Subsequent retrieval calls are scoped to the narrowest required parameters rather than pulling broad context, keeping each request small and the prefix stable.
Context engineering and prompt caching are not separate optimizations. They are the same optimization viewed from different angles.
Sources: Don’t Break the Cache (arXiv) · Manus: Context Engineering for AI Agents · Anthropic Prompt Caching Docs · Redis: Prompt Caching vs Semantic Caching · Thomson Reuters Labs: 60% Cost Reduction · Anthropic: Effective Context Engineering · Amazon Bedrock Prompt Caching
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container