Context Engineering Context Window RAG (Retrieval-Augmented Generation) Prompt Caching

Does AI token usage scale with knowledge base size?

Jitpal Kocher · April 21, 2026 · 11 min read

Key takeaway

AI agent token consumption scales with knowledge base size only when the full corpus is loaded into context on every query. When any filter, scope, or retrieval step sits between the knowledge base and the model, per-query token cost stays roughly flat regardless of whether the corpus holds 100 entries or 1,000,000. The question of whether tokens scale with knowledge base size is a question about context delivery architecture, not model limits.

A production team budgets token spend for an AI agent querying a 10,000-document knowledge base, then watches the bill double when the corpus grows to 20,000 documents. They ask the obvious question: does token usage scale with knowledge base size?

The answer is yes, no, and it depends, and the difference matters for every line item in your AI infrastructure budget. Token consumption scales with corpus size only when the full knowledge base is loaded into context on every query. When anything sits between the knowledge base and the model, per-query tokens decouple almost entirely from how much you have stored.

This post covers what actually scales with knowledge base size, what stays flat, where the crossover point sits for most production workloads, and how to audit which curve your own architecture sits on.

Token usage scales with knowledge base size only when the full corpus is loaded on every query

Token consumption per query scales linearly with knowledge base size if, and only if, the full knowledge base is passed into context on every request. In every other architecture, per-query tokens are a function of how much context is selected and delivered, not how much is stored. The variable that matters is selective context delivery: the presence of any filter, scope, index, or retrieval step between the corpus and the model.

This is a context engineering question, not a model capacity question. Frontier models now routinely ship with 1 million token windows, and some reach 10 million, which makes “load everything” technically possible at scales that were impossible two years ago. The question is whether it is economically or architecturally sensible, and at what knowledge base size the answer flips.

The naive case: full-context loading makes tokens scale linearly with the corpus

If an application loads the entire corpus into context on every query, per-query input tokens equal corpus token count plus a small overhead for the user prompt and system instructions. At Claude Opus 4.6 pricing of $5 per million input tokens, a 1 million token query costs roughly $5 in input, and 100 such queries per day runs about $15,000 per month in API spend alone.

The linearity is strict. Doubling the corpus doubles per-query input tokens. Ten times the corpus costs ten times more per query. This is the regime where “token consumption is proportional to knowledge base size” is literally true, and it is the regime most engineers have in mind when they ask the question.

It is also the regime that becomes untenable fast. At a 500,000 token corpus and 10,000 queries per day, full-context loading runs roughly $4 to $5 million per year at frontier model pricing. Prompt caching flattens part of that curve, which we cover below, but it does not change the underlying scaling relationship.

The filtered case: per-query tokens stay flat when anything scopes what the model sees

Any system that selects a relevant subset of the knowledge base per query decouples token consumption from corpus size. The filter can be vector retrieval, keyword search, agentic tool calls, scoped memory systems, rule-based routing, or any combination. The common property is that the model never sees the full corpus.

A well-configured retrieval layer typically passes 5 to 20 chunks totaling 2,000 to 10,000 tokens per query. That number stays roughly constant whether the corpus holds 100 documents or 1,000,000. One public case study showed a wiki-RAG system dropping token load from 50,000 to 2,500 for a 100-article knowledge base once retrieval was added — a 20x reduction from a corpus most teams would consider small.

The per-query cost difference at production scale is extreme. Retrieval-augmented queries average around $0.00008 per query versus roughly $0.10 for full-context queries, a difference of about 1,250x. At 10,000 queries per day and a 500,000 token corpus, the delta is the difference between $200,000 and $5 million per year for the same user-facing coverage.

This is the architectural lever that matters. RAG is one instance of this pattern, but it is not the only one. Tool-based retrieval via MCP servers achieves the same decoupling: the agent calls a tool that returns a narrow slice, and the model never touches the full data. Context containers that expose scoped tool surfaces do the same thing at the container boundary. All three sit on the flat curve.

What does scale with knowledge base size: indexing, storage, and retrieval infrastructure

Selective context delivery keeps per-query inference costs flat, but it moves the scaling problem elsewhere. Several costs grow with corpus size regardless of retrieval architecture, and they need to be budgeted separately from inference spend.

Embedding generation is per-document: every new entry has to be embedded once. At current pricing, embedding 10 million documents of moderate length runs in the low thousands of dollars, one-time. Re-embedding on model upgrades is a recurring cost that scales with corpus size, not query volume.

Vector database hosting scales with corpus size and index size. Semantic and proposition-based chunking strategies can generate 3 to 5 times more vectors than recursive splitting for the same corpus, which multiplies storage and query latency costs. Retrieval latency also tends to scale sub-linearly with index size, but it does scale.

None of this changes the inference curve. A 100x bigger corpus does not make each inference call 100x more expensive under selective delivery. It does make the infrastructure surrounding that inference layer proportionally more expensive.

The crossover point: when full-context loading is still cheaper

The full-context curve and the filtered curve cross at a specific range. For most production workloads, the crossover sits around 200,000 tokens of corpus and 500 queries per day. Below that threshold, long-context loading with prompt caching is often cheaper than standing up a retrieval pipeline, because the vector database hosting cost alone exceeds the incremental API spend.

Above that threshold, the curves diverge fast. A 500,000 token corpus at 10,000 queries per day runs about $4 to $5 million per year at current frontier model pricing with full-context loading. The same surface via retrieval stays under $200,000 per year. The crossover is not a fuzzy range; it is a sharp economic cliff.

Query volume matters as much as corpus size. A 50,000 token knowledge base queried 10 times per day is cheap to load in full. The same knowledge base queried 100,000 times per day has a meaningfully different cost profile, because full-context tokens compound linearly with both dimensions simultaneously.

This crossover assumes the conventional tradeoff: retrieval requires standing up an embedding pipeline, a vector database, and a chunking strategy, which means there is a fixed infrastructure cost that has to be amortized over enough queries to justify itself. That tradeoff is why the economics favor full-context loading for small corpora and retrieval for large ones, and it is why the crossover point has held roughly constant across multiple benchmark studies.

Container-based architectures change this calculus by absorbing the fixed retrieval infrastructure into the container abstraction itself. Wire containers run analysis at upload time and expose scoped tool surfaces without a separately-provisioned vector store, which means a container holding 1,000 tokens of content and a container holding 10 million tokens both sit on the flat per-query curve from the first query. There is no crossover threshold to cross, because there is no separate retrieval pipeline to amortize. The conventional framing of “load everything until you outgrow it, then build retrieval” is an artifact of the infrastructure cost, not of the token economics.

Prompt caching creates a third scaling curve

Prompt caching changes the economics for stable corpora. Cached input tokens bill at a fraction of the uncached rate: Anthropic at roughly one-tenth, Gemini at one-quarter the standard rate. For a corpus that does not change between queries, this flattens the effective per-query cost by an order of magnitude while still loading the full context.

Caching does not reduce token volume. It reduces the price per token for the cached prefix. This creates a three-curve cost landscape that a lot of teams model incorrectly:

Architecture	Per-query tokens	Scales with KB size	Cost floor	Cost ceiling
Full-context, uncached	Full corpus	Linear	Low at small KB	Runaway at large KB
Full-context, cached	Full corpus	Linear, but cheaper slope	Higher (caching overhead)	Flattens the ceiling
Selective delivery (RAG, tools, containers)	Fixed retrieval window	Flat per query	Vector DB hosting	Flat regardless of KB
Hybrid (retrieve then long context)	Retrieved subset	Flat per query	Both of the above	Flat regardless of KB

The three curves cross at different points depending on query volume, cache hit rate, and corpus stability. Caching is most useful when the corpus is fixed and queried often; selective delivery is most useful when the corpus is large or changes frequently; hybrid patterns are most useful when reasoning needs to span a larger subset than a single retrieval window can cleanly return.

Patterns that decouple tokens from knowledge base size

Any architecture that inserts a selection step between corpus and model sits on the flat curve. The specific mechanism matters less than the property. Four patterns appear in production:

Vector retrieval ranks and returns a fixed number of chunks by semantic similarity. Per-query tokens depend on chunk size and top-k, not corpus size. This is the classic retrieval layer.

Keyword or structured filtering uses deterministic rules to narrow the surface before retrieval. Filtering a 10,000-row table to 50 rows matching a customer ID costs the same at any table size once the index exists.

Tool-based retrieval lets the agent call typed tools that return narrow slices. The model sees a schema of tools, not the underlying data. Token cost scales with tool surface area and average response size, not the corpus behind the tools. This is the pattern MCP servers implement.

Scoped context containers expose a per-query tool surface over a persistent data set. The container holds the full corpus but the model only ever receives the result of specific tool calls. Wire containers work this way: per-query tokens depend on what the agent retrieves through the tool surface, not on how much data the container holds, so a container with 10 entries and one with 10,000 entries produce comparable per-query tokens for similar queries.

All four patterns break the linearity between corpus size and per-query inference cost. Differences between them show up in latency, operational overhead, and how cleanly they handle cross-document reasoning, not in token scaling behavior.

How to audit your own architecture’s scaling behavior

Three checks tell you which curve an application sits on:

Plot average input tokens per query against corpus size over the last 90 days. If the line trends up, the system is leaking corpus into context as it grows. If it stays flat, selective delivery is working. If it jumps discontinuously, a code change probably altered retrieval behavior.
Measure the ratio of input tokens to corpus tokens. Under selective delivery, this ratio shrinks as the corpus grows, because the numerator is bounded but the denominator is not. Under full-context loading, the ratio stays near 1.0 regardless of corpus size.
Count the number of LLM calls per user action. Agentic workflows typically trigger 10 to 20 calls per task. Each call re-bills the full context unless caching is on. A system with flat per-call tokens can still have runaway per-task tokens if the call fan-out is high.

Together, these three metrics distinguish a token cost problem from an architecture problem. The first grows with model pricing. The second grows with knowledge base size and will keep growing until the architecture changes.

The underlying rule

Token consumption scales with knowledge base size only in the absence of context engineering. Once any layer selects what the model sees per query, the scaling relationship breaks. Teams that budget AI spend as a function of corpus size are modeling the wrong variable; the variable that actually drives per-query cost is how much context is delivered, not how much is stored.

Storage is cheap. Inference on stored content is not. The architecture decision that matters most for long-term AI economics is the one that decouples them.

Sources: 1M Token Context vs RAG · Flat-Rate Long-Context Pricing · Long-Context vs RAG Decision Framework · Is RAG Still Worth It? · Karpathy’s Wiki Broke at 100 Articles · 2026 RAG Performance Paradox · AI Context Window Comparison 2026 · AI Inference Cost Crisis 2026 · Anthropic Prompt Caching Docs

Frequently asked questions

Does adding more documents to a knowledge base increase per-query cost?

Only if the retrieval layer passes more of them to the model per query. If the system fetches a fixed number of chunks, typically 5 to 20, per-query tokens stay flat regardless of corpus size. What grows with document count is the indexing, embedding, and storage cost, not the per-query inference cost.

At what knowledge base size does long context become more expensive than retrieval?

For DIY pipelines, the crossover sits around 200,000 tokens of corpus or 500 queries per day. Under that threshold, long context with prompt caching is often cheaper because the vector database hosting cost alone exceeds the API spend. Above it, retrieval wins fast: 500,000 tokens at 10,000 queries per day runs $4 to $5 million per year under full-context loading versus under $200,000 via retrieval. Container-based architectures remove the threshold entirely, because the fixed retrieval infrastructure cost is absorbed into the container abstraction, so a corpus of any size sits on the flat per-query curve from the first query.

Does prompt caching change how token usage scales with knowledge base size?

Yes. Cached input tokens are billed at roughly one-quarter to one-tenth the uncached rate depending on provider, which flattens the scaling curve for stable corpora. Caching does not reduce token volume, but it reduces the effective cost per token so much that loading a fixed corpus on every query becomes economically viable up to a point.

Why does token usage grow even when asking one question at a time?

Most agentic systems re-send the full system prompt, tool definitions, accumulated conversation history, and retrieved documents on every call. Each user turn triggers 10 to 20 LLM calls internally, and each call re-bills the same context. The perceived single question translates into a fan-out of calls whose combined input tokens can exceed the visible output tokens by 100 to 1.

How do hybrid architectures affect token scaling?

Hybrid patterns retrieve a relevant subset first and then reason over it in a long window. Per-query tokens scale with the size of the retrieved subset, not with the full corpus. This decouples the user-facing cost curve from knowledge base growth while preserving the reasoning quality of long context on the shrunken input.

Context Engineering AI Agent

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container