MCP vs Skills vs CLI: which one wastes the least context?
Key takeaway
Progressive tool loading defers MCP tool definitions until an agent actually needs them, instead of dumping every connected tool into the system prompt at session start. Anthropic's code-execution-with-MCP pattern reports a reduction from roughly 150,000 input tokens to 2,000 for the same task, a 98.7% drop. As agents connect to more servers in 2026, preloaded tool catalogs are the largest single source of wasted context, and progressive disclosure has become the default in production MCP design.
The headline number from Anthropic’s April 2026 work on MCP is hard to ignore. A reference agent task that needed roughly 150,000 input tokens with all tools preloaded into context dropped to about 2,000 tokens when tool definitions were loaded only when used. That is a 98.7% reduction on the same task, on the same model, with the same tools available.
That gap is the story of MCP in 2026. The protocol succeeded so completely that connecting an agent to more than a handful of servers became the dominant context cost. By April 2026 there were over 10,000 enterprise MCP servers and more than 97 million SDK downloads across providers, and the average agent now lives inside a tool catalog large enough to crowd out the user’s actual question. Progressive tool loading, sometimes called progressive disclosure, is how production teams are responding.
This post is about what progressive tool loading actually is, why preloaded catalogs broke down, and the design choices that separate the implementations that work from the ones that just shuffle the same tokens around.
Progressive tool loading defers the body of an MCP tool definition until the agent needs it, instead of placing every connected tool’s full schema into the system prompt at session start. The agent still discovers what tools exist, but the verbose schemas, parameter descriptions, and example payloads load on demand. Anthropic’s code-execution-with-MCP write-up frames this as the default pattern for new servers, and parallel implementations from Klavis, Speakeasy, and the broader MCP community have converged on the same shape.
Concretely, a session starts with a small index. Tool name, one-line purpose, server of origin. When the model decides to call a tool, the runtime fetches that tool’s full schema, places it in context for the call, and lets it fall out afterward. Multi-step tasks accumulate only the schemas the trajectory actually touched, not the union of every connected server.
The technique is a context engineering move, not a protocol change. It works on the current MCP spec because MCP already separates discovery from invocation. What changed in 2026 is the assumption that every client should serialize the discovery results into one big system prompt.
Preloaded catalogs broke down because tool surface scaled faster than context windows did. By Q1 2026 a typical agent in production was connecting to between five and twenty MCP servers, each exposing five to fifty tools, each with a JSON Schema that is rarely under 300 tokens and often over 1,000. The arithmetic is unforgiving: ten servers with twenty tools at 500 tokens each is 100,000 tokens before the user has typed a word.
Three failure modes show up at that scale.
The first is straightforward cost. Long inputs get charged on every turn, and tool catalogs are present on every turn, so the marginal cost of an unused tool is paid forever. Prompt caching helps but does not eliminate the bill, especially when catalogs change across sessions.
The second is context rot. The longer the input, the worse models attend to mid-context content, and tool catalogs typically sit in exactly that mid-context band. Stanford’s lost-in-the-middle work has been replicated across more than a dozen models since 2023, and the implication for tool design is the same one the agent drift literature describes: relevant tools buried under dozens of irrelevant ones get used less reliably than the same tools presented in isolation.
The third is security blast radius. The April 2026 MCP security audits reported that 43% of public MCP servers had at least one vulnerability and that 5.5% already shipped with poisoned descriptions in the wild. Preloading tool definitions means every poisoned description enters context every session, which is the worst possible substrate for prompt-injection mitigation. A pattern that loads descriptions only on use shrinks that surface significantly. It does not solve tool poisoning, but it stops amplifying it.
Progressive tool loading is usually implemented as a thin runtime layer between the agent and its MCP servers. The runtime keeps two views of the tool surface.
The first view is the index. A short list of (server, tool, one-line description) tuples, kept in context for the entire session, that the model uses to decide what to call. Indexes typically run 10 to 30 tokens per tool, an order of magnitude smaller than full schemas.
The second view is the lazy detail. Full schemas, parameter descriptions, example inputs, and any server-supplied annotations live in the runtime, not in context. When the model emits a call to a tool whose schema it has not seen, the runtime intercepts the call, fetches the schema, validates the call against it, and either executes or returns the schema for a corrected retry. After execution, the schema can either fall out of context or stay for the rest of the session depending on policy.
Two policy choices dominate the implementation space.
| Policy | What stays in context | When it fits |
|---|---|---|
| Strict lazy | Only the index. Schemas load and unload per call. | High tool counts, short tasks, cost-sensitive workloads |
| Sticky lazy | Index plus schemas of any tool used so far this session. | Multi-step tasks where the same tool is called repeatedly |
| Bounded sticky | Index plus an LRU of the N most recently used schemas. | The default for general-purpose agents |
| Code-mediated | Index only, with tools invoked from generated code rather than direct calls. | Highest token efficiency, requires sandbox |
Code-mediated invocation is the variant Anthropic benchmarked at the 98.7% reduction. The agent writes code that imports tools from a typed namespace, the sandbox executes that code, and only the index and the code execution result enter context. It is the most efficient because the entire intermediate trajectory (parameter selection, error handling, sub-tool calls) happens outside the model’s context window. It is also the most invasive to deploy.
Bounded sticky is the variant most teams reach for first because it requires no execution sandbox and behaves close to the preloaded version on small surfaces. The tradeoff is less aggressive savings.
If your server is going to be consumed under a progressive loading runtime, two design decisions matter more than they used to.
Tool descriptions become the index. The one-line description in your tool registration is now the only thing the agent sees by default; it does the work that the full schema used to do. Vague descriptions cost calls because the agent picks the wrong tool, then has to recover. Specific, action-shaped descriptions (“search across container entries semantically and return relevance-ranked matches with provenance”) outperform generic ones (“query the container”) by a wide margin, and the gap widens under progressive loading because the schema is no longer there to disambiguate.
Tool surfaces should be narrower per tool, not wider. The one-job-per-tool pattern we benchmarked in April produced a 24% reduction in total calls and a 7% lift in correctness on the same dataset. Under progressive loading those numbers compound, because each wrong-mode retry on an overloaded tool now also pays the cost of fetching that tool’s schema, not just the cost of the call. Mode parameters on tools were always a soft anti-pattern; progressive loading turns them into a measurable one.
Schemas should declare what is optional aggressively. When a schema does load, the model spends attention on every required parameter. Required parameters that are usually defaulted, deprecated fields kept for backward compatibility, and verbose enum lists are all attention drains that progressive loading cannot save you from once the schema is in context. The cleanest schemas in the wild after April 2026 read like API surfaces with strong defaults, not like exhaustive specifications.
This is also a moment to revisit the MCP 2026 roadmap. Progressive discovery and composable tool execution sit alongside stateless transport and server discovery as priorities, and the infrastructure those features unlock is what makes runtime-side progressive loading cheap. The pattern will work better on servers that opt into the discovery primitives than on servers that emulate them.
Progressive loading is not free, and the cases where it loses are worth naming clearly.
Single-server, low-tool-count agents pay for the bookkeeping without recovering enough context. A coding assistant connected to one MCP server with eight tools is best served by the preloaded catalog. The break-even is somewhere between fifteen and thirty tools depending on schema size; below that, the index plus runtime adds latency without buying back enough tokens.
Tasks with extreme tool churn can defeat sticky policies. An agent that calls a different tool on every turn forces a fetch on every turn, and the fetch latency starts to matter. Strict lazy is correct here; bounded sticky thrashes.
Latency-sensitive workloads pay an extra round trip on first use of any tool. For interactive agents this is usually invisible, but for sub-second SLAs it can be the difference between hitting and missing budget. The mitigation is server-side support for batched schema fetches, which the 2026 spec work is moving toward.
And progressive loading does not fix tool design. A poorly described tool is still a poorly described tool when its description is loaded lazily. The pattern reduces the cost of badly designed surfaces; it does not redeem them. Teams that ship progressive loading on top of overloaded tools sometimes report disappointing numbers, and the diagnosis is almost always upstream of the runtime.
If you run agents in production against more than a handful of MCP servers, three changes pay back quickly.
Audit the tool-catalog cost on a typical session. Count input tokens before the first user message. If it is over 20,000, you are in the band where progressive loading is worth implementing.
Rewrite your one-line tool descriptions assuming they are the only thing the agent will see. Action verb, scope, output shape. That is the entire budget.
Pick a sticky policy before you pick a runtime. Bounded sticky with an LRU of five to ten schemas is the right default for most agents; strict lazy is right when tool churn is high; code-mediated is right when you already have a sandbox. Choosing the policy first keeps the runtime decision from leaking into your agent code.
The pattern is small. The savings are not.
Sources: Code execution with MCP (Anthropic) · MCP’s 2026 Roadmap · State of Context Engineering 2026 (Aurimas Griciūnas) · Agentic Context Engineering (arXiv:2510.04618, ICLR 2026) · CIS MCP Security Guide (Cequence) · Context engineering as the missing layer in agentic AI (SiliconANGLE) · Progressive Disclosure MCP benchmark (Matthew Kruczek) · Lost in the Middle (Liu et al., Stanford)
Related
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container