context-engineering ai-agents prompt-caching context-compression llm-research

Why your AI costs are a context problem

Jitpal Kocher · April 9, 2026 · 8 min read

Key takeaway

Most teams treat rising AI costs as a model pricing problem. The data tells a different story: context architecture decisions drive 60-70% of total AI spend. Token prices have fallen 280x in two years while enterprise budgets grew 320%, because agentic workflows multiply token volume 10-20x per task. The teams spending the least per task got there by restructuring context delivery, not by switching models.

Token prices have fallen roughly 280x in two years. Enterprise AI budgets have risen 320% over the same period, from an average of $1.2 million in 2024 to $7 million in 2026. The math should not work this way. Cheaper units should mean lower bills.

The math works this way because AI costs are not a pricing problem. They are a context problem. How you structure, deliver, and manage context determines 60-70% of your total AI spend, and most teams have never audited that architecture.

The token price illusion

Per-token prices are falling fast, and that is real. Claude Opus 4.6 costs $5 per million input tokens. GPT-5.2 is at $1.75. Haiku 4.5 is at $1. Two years ago, comparable capabilities cost hundreds of dollars per million tokens. The price curve looks like a gift.

The illusion is assuming that total cost tracks per-token price. It does not, because agentic workflows changed the multiplier. A chatbot makes one LLM call per user message. An AI agent makes 10-20 calls per task: planning, tool selection, execution, verification, error recovery, response generation. Each call re-sends the full context: system instructions, tool definitions, conversation history, retrieved documents.

Inference now accounts for 85% of the enterprise AI budget in 2026, and the FinOps Foundation reports that 98% of organizations are actively managing AI spend, up from 31% two years ago. The bills got people’s attention. But most teams are optimizing the wrong variable.

Where the money actually goes

An unconstrained AI agent solving a software engineering task costs $5-8 in API fees per task. A multi-step research agent can burn through $5-15 in minutes. To understand why, look at where the tokens go.

Manus, before its acquisition by Meta, reported an average input-to-output ratio of 100:1. For every token the model generates, it processes 100 tokens of context. That context is the system prompt, tool definitions, conversation history, retrieved documents, and prior tool results, all re-sent on every single API call.

The cost structure reinforces this. In 2026, the median output-to-input cost ratio across major providers sits at approximately 4:1. Output tokens are more expensive per unit, but input tokens dominate total spend because there are so many more of them. When your agent makes 15 calls per task and each call sends 50,000 tokens of context, you are paying to re-process 750,000 input tokens for what might produce 7,500 tokens of output.

This is the context rot problem expressed in dollars. As agents accumulate history, every subsequent call carries more weight. Token count goes up. Cost per task goes up. And most of those tokens are the same ones the model already processed on the previous call.

Four context decisions that set your costs

Context engineering is as much an economic decision as a technical one. Four architectural decisions determine where most of your AI budget goes.

What you send

Structured context carries the same information in fewer tokens than raw text. A well-organized JSON payload with labeled fields and relationships transmits meaning more densely than a wall of unprocessed prose. Research on structured versus raw text for AI shows that models extract information more reliably from structured input, which also means fewer retry cycles and less wasted compute.

The cost impact compounds across every API call. If structuring your context reduces input by even 30%, and your agent makes 15 calls per task, you save 30% across all 15 calls. Pre-processing context at upload time rather than re-processing it on every query is the cost-efficient pattern. Wire containers take this approach: files are processed into structured, queryable formats once, and agents retrieve only the relevant entries on each call rather than ingesting raw documents repeatedly.

How much you send

Context compression reduces what agents carry forward between steps. Factory.ai’s structured summarization approach maintained task accuracy while cutting memory usage by 26-54%. LangChain’s Deep Agents framework offloads tool results exceeding 20,000 tokens to the filesystem, replacing them with a file path and a 10-line preview.

The arXiv paper on AgentDiet found that agent trajectories accumulate useless, redundant, and expired information that can be removed without harming performance. Another study, CigaR, demonstrated 73% token cost reduction through context-aware optimization that focuses the model on key information rather than processing everything. The pattern is consistent: agents perform as well or better with less context, and every removed token saves money. Our deep dive on context compression covers the techniques in detail.

How often you re-send

Prompt caching is the single highest-leverage cost optimization for agents with stable prefixes. Anthropic offers 90% discounts on cached input tokens. Research across 500+ agent sessions found 41-80% cost reduction from system-prompt-only caching, with time-to-first-token improvements of 13-31%.

The key constraint is prefix stability. Caching only works when the beginning of each prompt stays identical. Dynamic timestamps, changing tool definitions, or compression that rewrites earlier turns all invalidate the cache. Structuring prompts with static content first and dynamic content last is the foundational pattern. Our prompt caching deep dive covers implementation in detail.

Which model processes it

Model routing sends routine tasks to smaller, cheaper models and reserves frontier models for complex reasoning. 60-70% of LLM calls in a typical agent pipeline are routine tasks (classification, extraction, formatting) that small models handle well. A task routed to a frontier reasoning model may cost 190x more than the same task handled by a fast, smaller model.

Cascading approaches, where tasks start at a cheap model and escalate only when needed, regularly achieve 60-87% cost reduction because the expensive models only process what they must.

Context architecture versus model switching

Strategy	Savings	How it works	Trade-off
Model downgrade (e.g. Opus to Haiku)	5x per token	Lower price per unit	Linear savings, potential quality loss
Prompt caching	41-80% total	Reuse computed prefix	Requires stable prompt structure
Context compression	26-54% total	Reduce tokens carried forward	May lose edge-case details
Structured context	20-40% total	Denser information per token	Requires pre-processing pipeline
Model routing	60-87% total	Match task to model	Needs classifier or complexity scoring
Combined context optimization	70-90% total	All of the above	Multiplicative across all calls

Switching from Opus to Haiku saves 5x per token. That is a meaningful reduction on any individual call. But it is linear: you pay less per token while processing the same volume of tokens.

Restructuring context architecture is multiplicative. Compression reduces the tokens. Caching avoids re-processing them. Structuring makes each token carry more information. Routing sends the reduced, structured, cached context to the right model. Each layer compounds on the others.

A team that switches to a cheaper model saves 5x. A team that restructures context and keeps the better model saves 70-90% while maintaining quality. The context engineering approach treats cost as a design outcome, not a procurement decision.

What production teams are doing

The teams with the lowest per-task costs share a pattern: they treat context architecture as the primary cost lever.

Manus made KV-cache hit rate their north star metric for production agents. By structuring every prompt so the prefix stayed stable across calls, they maximized caching savings on every request in a session. The 100:1 input-to-output ratio meant caching had an outsized impact on their total bill.

The AgentDiet research team built trajectory reduction into the agent loop itself, automatically identifying and removing expired, redundant, and irrelevant information from the agent’s accumulated history. The result was lower token counts without degraded performance, because the removed content was not contributing to task completion anyway.

The CigaR framework took a different angle: rather than removing context, it optimized how context was presented to maximize information density per token. The 73% cost reduction came not from sending less, but from sending the same information in a more token-efficient format.

These are not competing approaches. They are layers in the same stack: structure the context, compress what accumulates, cache what repeats, route what remains. Each one reduces the token volume that the next one operates on.

Your AI bill is a context audit

If your AI costs are climbing, the diagnostic is straightforward:

Measure your input/output ratio. If it exceeds 50:1, context re-processing is dominating your spend.
Check your cache hit rate. If it is below 70%, your prompt structure is probably invalidating the cache.
Audit what you send. Are you re-sending raw documents on every call, or delivering pre-structured, scoped context?
Map your model usage. Are frontier models processing routine classification and extraction tasks?
Track cost per task, not cost per token. Per-token prices are misleading when volume is the real variable.

The 280x drop in token prices is real. But for teams running AI agents in production, the bill is set by architecture, not by pricing. The teams spending the least per task are not the ones who found the cheapest model. They are the ones who send the right context, in the right format, at the right time.

Sources: Deloitte: AI Token Spend Dynamics · Zylos Research: AI Agent Cost Optimization · Stevens: Hidden Economics of AI Agents · Manus: Context Engineering Lessons · Don’t Break the Cache (arXiv) · CigaR: Cost-Efficient Program Repair (arXiv) · AgentDiet: Trajectory Reduction (arXiv) · Factory.ai: Evaluating Compression · FinOps Foundation: GenAI Token Pricing · AnalyticsWeek: Inference Economics

Frequently asked questions

Why are AI costs rising when token prices are falling?

Token prices have dropped roughly 280x in two years, but enterprise AI spend has risen 320% over the same period. The driver is volume: agentic workflows trigger 10-20 LLM calls per user task, RAG architectures inflate context windows 3-5x, and always-on monitoring agents consume compute around the clock. Cheaper tokens multiplied by far more tokens equals higher bills.

How much does it cost to run an AI agent in production?

An unconstrained AI agent solving a software engineering task costs $5-8 per task in API fees alone. A multi-step research agent can burn $5-15 in minutes. The largest cost component is input tokens, with production agents showing input-to-output ratios around 100:1, meaning context delivery dominates the bill.

What is the biggest driver of AI API costs?

Context re-processing is the single largest cost driver for AI agents. Every API call re-sends system instructions, tool definitions, and accumulated history. Poor context management accounts for 60-70% of total AI spend according to Zylos Research. Techniques like prompt caching, context compression, and structured context delivery can reduce this by 50-80%.

How do you reduce AI agent costs without losing quality?

Four context architecture changes deliver the largest savings: prompt caching (41-80% cost reduction on repeated prefixes), context compression (26-54% token reduction), structured context delivery (fewer tokens for the same information), and model routing (sending routine tasks to smaller models). Combined, these techniques can cut agent costs by 70-90%.

Does switching to a cheaper AI model save money?

Switching models provides linear savings, typically 3-5x per token. But restructuring context architecture provides multiplicative savings of 50-80% on total spend, because it reduces the volume of tokens processed across all models. Most teams get better ROI from context optimization than from model downgrading, especially when quality matters.

context-compression context-engineering

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container