Context budgets: how to allocate tokens for AI agents
Key takeaway
AI agent token consumption scales with knowledge base size only when the full corpus is loaded into context on every query. When any filter, scope, or retrieval step sits between the knowledge base and the model, per-query token cost stays roughly flat regardless of whether the corpus holds 100 entries or 1,000,000. The question of whether tokens scale with knowledge base size is a question about context delivery architecture, not model limits.
A production team budgets token spend for an AI agent querying a 10,000-document knowledge base, then watches the bill double when the corpus grows to 20,000 documents. They ask the obvious question: does token usage scale with knowledge base size?
The answer is yes, no, and it depends, and the difference matters for every line item in your AI infrastructure budget. Token consumption scales with corpus size only when the full knowledge base is loaded into context on every query. When anything sits between the knowledge base and the model, per-query tokens decouple almost entirely from how much you have stored.
This post covers what actually scales with knowledge base size, what stays flat, where the crossover point sits for most production workloads, and how to audit which curve your own architecture sits on.
Token consumption per query scales linearly with knowledge base size if, and only if, the full knowledge base is passed into context on every request. In every other architecture, per-query tokens are a function of how much context is selected and delivered, not how much is stored. The variable that matters is selective context delivery: the presence of any filter, scope, index, or retrieval step between the corpus and the model.
This is a context engineering question, not a model capacity question. Frontier models now routinely ship with 1 million token windows, and some reach 10 million, which makes “load everything” technically possible at scales that were impossible two years ago. The question is whether it is economically or architecturally sensible, and at what knowledge base size the answer flips.
If an application loads the entire corpus into context on every query, per-query input tokens equal corpus token count plus a small overhead for the user prompt and system instructions. At Claude Opus 4.6 pricing of $5 per million input tokens, a 1 million token query costs roughly $5 in input, and 100 such queries per day runs about $15,000 per month in API spend alone.
The linearity is strict. Doubling the corpus doubles per-query input tokens. Ten times the corpus costs ten times more per query. This is the regime where “token consumption is proportional to knowledge base size” is literally true, and it is the regime most engineers have in mind when they ask the question.
It is also the regime that becomes untenable fast. At a 500,000 token corpus and 10,000 queries per day, full-context loading runs roughly $4 to $5 million per year at frontier model pricing. Prompt caching flattens part of that curve, which we cover below, but it does not change the underlying scaling relationship.
Any system that selects a relevant subset of the knowledge base per query decouples token consumption from corpus size. The filter can be vector retrieval, keyword search, agentic tool calls, scoped memory systems, rule-based routing, or any combination. The common property is that the model never sees the full corpus.
A well-configured retrieval layer typically passes 5 to 20 chunks totaling 2,000 to 10,000 tokens per query. That number stays roughly constant whether the corpus holds 100 documents or 1,000,000. One public case study showed a wiki-RAG system dropping token load from 50,000 to 2,500 for a 100-article knowledge base once retrieval was added — a 20x reduction from a corpus most teams would consider small.
The per-query cost difference at production scale is extreme. Retrieval-augmented queries average around $0.00008 per query versus roughly $0.10 for full-context queries, a difference of about 1,250x. At 10,000 queries per day and a 500,000 token corpus, the delta is the difference between $200,000 and $5 million per year for the same user-facing coverage.
This is the architectural lever that matters. RAG is one instance of this pattern, but it is not the only one. Tool-based retrieval via MCP servers achieves the same decoupling: the agent calls a tool that returns a narrow slice, and the model never touches the full data. Context containers that expose scoped tool surfaces do the same thing at the container boundary. All three sit on the flat curve.
Selective context delivery keeps per-query inference costs flat, but it moves the scaling problem elsewhere. Several costs grow with corpus size regardless of retrieval architecture, and they need to be budgeted separately from inference spend.
Embedding generation is per-document: every new entry has to be embedded once. At current pricing, embedding 10 million documents of moderate length runs in the low thousands of dollars, one-time. Re-embedding on model upgrades is a recurring cost that scales with corpus size, not query volume.
Vector database hosting scales with corpus size and index size. Semantic and proposition-based chunking strategies can generate 3 to 5 times more vectors than recursive splitting for the same corpus, which multiplies storage and query latency costs. Retrieval latency also tends to scale sub-linearly with index size, but it does scale.
None of this changes the inference curve. A 100x bigger corpus does not make each inference call 100x more expensive under selective delivery. It does make the infrastructure surrounding that inference layer proportionally more expensive.
The full-context curve and the filtered curve cross at a specific range. For most production workloads, the crossover sits around 200,000 tokens of corpus and 500 queries per day. Below that threshold, long-context loading with prompt caching is often cheaper than standing up a retrieval pipeline, because the vector database hosting cost alone exceeds the incremental API spend.
Above that threshold, the curves diverge fast. A 500,000 token corpus at 10,000 queries per day runs about $4 to $5 million per year at current frontier model pricing with full-context loading. The same surface via retrieval stays under $200,000 per year. The crossover is not a fuzzy range; it is a sharp economic cliff.
Query volume matters as much as corpus size. A 50,000 token knowledge base queried 10 times per day is cheap to load in full. The same knowledge base queried 100,000 times per day has a meaningfully different cost profile, because full-context tokens compound linearly with both dimensions simultaneously.
This crossover assumes the conventional tradeoff: retrieval requires standing up an embedding pipeline, a vector database, and a chunking strategy, which means there is a fixed infrastructure cost that has to be amortized over enough queries to justify itself. That tradeoff is why the economics favor full-context loading for small corpora and retrieval for large ones, and it is why the crossover point has held roughly constant across multiple benchmark studies.
Container-based architectures change this calculus by absorbing the fixed retrieval infrastructure into the container abstraction itself. Wire containers run analysis at upload time and expose scoped tool surfaces without a separately-provisioned vector store, which means a container holding 1,000 tokens of content and a container holding 10 million tokens both sit on the flat per-query curve from the first query. There is no crossover threshold to cross, because there is no separate retrieval pipeline to amortize. The conventional framing of “load everything until you outgrow it, then build retrieval” is an artifact of the infrastructure cost, not of the token economics.
Prompt caching changes the economics for stable corpora. Cached input tokens bill at a fraction of the uncached rate: Anthropic at roughly one-tenth, Gemini at one-quarter the standard rate. For a corpus that does not change between queries, this flattens the effective per-query cost by an order of magnitude while still loading the full context.
Caching does not reduce token volume. It reduces the price per token for the cached prefix. This creates a three-curve cost landscape that a lot of teams model incorrectly:
| Architecture | Per-query tokens | Scales with KB size | Cost floor | Cost ceiling |
|---|---|---|---|---|
| Full-context, uncached | Full corpus | Linear | Low at small KB | Runaway at large KB |
| Full-context, cached | Full corpus | Linear, but cheaper slope | Higher (caching overhead) | Flattens the ceiling |
| Selective delivery (RAG, tools, containers) | Fixed retrieval window | Flat per query | Vector DB hosting | Flat regardless of KB |
| Hybrid (retrieve then long context) | Retrieved subset | Flat per query | Both of the above | Flat regardless of KB |
The three curves cross at different points depending on query volume, cache hit rate, and corpus stability. Caching is most useful when the corpus is fixed and queried often; selective delivery is most useful when the corpus is large or changes frequently; hybrid patterns are most useful when reasoning needs to span a larger subset than a single retrieval window can cleanly return.
Any architecture that inserts a selection step between corpus and model sits on the flat curve. The specific mechanism matters less than the property. Four patterns appear in production:
Vector retrieval ranks and returns a fixed number of chunks by semantic similarity. Per-query tokens depend on chunk size and top-k, not corpus size. This is the classic retrieval layer.
Keyword or structured filtering uses deterministic rules to narrow the surface before retrieval. Filtering a 10,000-row table to 50 rows matching a customer ID costs the same at any table size once the index exists.
Tool-based retrieval lets the agent call typed tools that return narrow slices. The model sees a schema of tools, not the underlying data. Token cost scales with tool surface area and average response size, not the corpus behind the tools. This is the pattern MCP servers implement.
Scoped context containers expose a per-query tool surface over a persistent data set. The container holds the full corpus but the model only ever receives the result of specific tool calls. Wire containers work this way: per-query tokens depend on what the agent retrieves through the tool surface, not on how much data the container holds, so a container with 10 entries and one with 10,000 entries produce comparable per-query tokens for similar queries.
All four patterns break the linearity between corpus size and per-query inference cost. Differences between them show up in latency, operational overhead, and how cleanly they handle cross-document reasoning, not in token scaling behavior.
Three checks tell you which curve an application sits on:
Together, these three metrics distinguish a token cost problem from an architecture problem. The first grows with model pricing. The second grows with knowledge base size and will keep growing until the architecture changes.
Token consumption scales with knowledge base size only in the absence of context engineering. Once any layer selects what the model sees per query, the scaling relationship breaks. Teams that budget AI spend as a function of corpus size are modeling the wrong variable; the variable that actually drives per-query cost is how much context is delivered, not how much is stored.
Storage is cheap. Inference on stored content is not. The architecture decision that matters most for long-term AI economics is the one that decouples them.
Sources: 1M Token Context vs RAG · Flat-Rate Long-Context Pricing · Long-Context vs RAG Decision Framework · Is RAG Still Worth It? · Karpathy’s Wiki Broke at 100 Articles · 2026 RAG Performance Paradox · AI Context Window Comparison 2026 · AI Inference Cost Crisis 2026 · Anthropic Prompt Caching Docs
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container