Progressive tool loading is the new MCP context pattern
Key takeaway
MCP, Claude Skills, and CLI tools are three competing ways to give AI agents capabilities, and the right comparison is context cost, not preference. MCP loads every tool's full schema into the model's system prompt on every turn, which is why GitHub's official MCP server alone has been measured at tens of thousands of tokens. Skills load only a name and short description by default and pull the full instructions in on demand. CLI shifts the cost almost entirely off the agent, since the model already knows how to read `--help`. The pattern most production teams are converging on is code execution with MCP, which Anthropic measured at a 98.7 percent token reduction by lazy-loading tool definitions from a filesystem instead of mounting them upfront.
Three of the most-discussed Hacker News threads of the past sixty days are about the same question from three different angles. David Mohl’s “I still prefer MCP over skills” landed in April with 460 points and 375 comments. Eric Holmes’s “MCP is dead, long live the CLI” hit 447 points in March. And Simon Willison’s October post arguing that Claude Skills might be a bigger deal than MCP is still actively cited in agent design debates eight months later.
All three authors argue partly on context window cost. Willison is the most explicit, quoting “tens of thousands of tokens” for GitHub’s official MCP server against “a few dozen extra tokens” per skill. Coffee’s defense of MCP leans on the “intelligent tool discovery” that only loads tools when needed. Holmes points out that piping a Terraform plan through an MCP server means either dumping the whole thing into context or building custom filtering. The token cost shows up in every thread.
What each post does is pick one mechanism and argue for it as the answer. In practice, that is the wrong shape for the question. The mechanisms have genuinely different cost curves, and the workload determines which curve you want to sit on. A capability the agent uses on every turn is a different problem from a capability used once a session. A tool the model already knows is a different problem from a niche tool it has to discover. An action that produces a small structured result is a different problem from one that produces a fifty-thousand-token blob. The right answer is rarely “always MCP” or “always skills” or “always CLI”. It is “MCP for this, skills for that, CLI where it fits, and code execution for the heavy paths”. This is a context engineering question, not a developer-experience question, and the four approaches sit at very different points on the cost curve.
The table below summarizes the protocol baseline cost (the worst-case, every-schema-loaded-every-turn number) against what modern harnesses can do to bring it down.
| Mechanism | Where it lives | Protocol baseline per capability | What modern harnesses optimize | Who decides to load |
|---|---|---|---|---|
| MCP tools | Host system prompt | 200 to 500 tokens per tool, every turn the server is mounted | Tool search defers schemas until selected; per-session tool toggling; tools/list_changed for dynamic surfaces | Server author, narrowed by harness |
| Claude Skills | Agent’s filesystem | A few dozen tokens for name and description, body loaded on use | Already lazy by primitive design; little room left to optimize | The model, then the runtime |
| CLI tools | Agent’s shell | Zero for well-known CLIs in training data; hundreds to thousands per --help invocation for niche ones | Caching of --help outputs across turns; shell history as cheap reference | The model |
Two important notes on this table. First, “code execution with MCP” is not a fourth row. It is MCP designed differently: an MCP server that exposes one tool (a code runner) plus a filesystem of typed definitions the runner can import. Protocol still MCP, surface in the model’s context just one tool. That design choice sits inside the MCP row, and it is the single biggest lever within MCP. Second, the protocol baseline is the ceiling, not what most users see in practice. Modern harnesses ship aggressive optimizations: Claude Code’s built-in tool search defers MCP schemas until a tool is actually selected, Cursor lets users enable and disable specific tools per session, and the MCP spec itself includes tools/list_changed so servers can swap their visible surface in and out. The realistic comparison is between protocol baselines (worst case) and practical floors (after the harness does its job).
The naive MCP pattern is what produces the eye-watering numbers. Mount a server, load every tool’s full schema into the model’s system prompt on every turn, pay 200 to 500 tokens per tool definition. Willison’s measurement of GitHub’s official MCP server, reported in the post linked above, is tens of thousands of tokens for one server. A team mounting four MCP servers (filesystem, Postgres, Slack, GitHub) with twenty to thirty tools each can hit a baseline tool-schema header in the 30,000 to 60,000 token range under that pattern. That header is paid for on every model turn whether the model invokes a tool or not.
But the practical cost depends on two layers underneath the protocol, and both can drop the number substantially. The first is the harness. Claude Code defers MCP schemas until its tool search selects one, which means the effective cost of a mounted but unused tool is close to zero. Cursor lets users enable and disable specific tools per session, so the schemas the model sees are scoped to the active workflow. The MCP spec itself ships tools/list_changed, which lets servers expose different tool subsets at different moments. The “30,000 token header” is a worst case that disappears when any of these are in play.
The second layer is server design. The protocol has three primitives: tools, resources, and prompts. Most servers ship tools-only, because the SDK quickstarts encourage it and the function-calling mental model from earlier LLM APIs maps cleanly onto it. The result is what people mean when they say MCP causes tool sprawl: flat lists of fine-grained tools, all loaded together, none of them resources. A server that exposes one tool, a code runner, plus a filesystem of typed tool definitions, has the same protocol underneath but a radically different cost profile. Anthropic published a benchmark on this exact reframing: a workflow that consumed 150,000 tokens with direct tool calls dropped to 2,000 tokens when the same tools were exposed as TypeScript modules in a filesystem the agent navigated programmatically. That is a 98.7 percent reduction, achieved within MCP, by designing the server differently.
The honest framing of MCP, then, is that the protocol’s token cost is the product of three things: the protocol baseline, what the harness does on top, and how the server author shaped the surface. Naive use of all three produces the numbers Willison reported. Aggressive use of all three lands close to skills.
Skills are markdown files with a small YAML header. The base entry the model sees is little more than the skill’s name and a one-line description, which Willison estimates at a few dozen tokens per skill. Only when the model decides a skill is relevant to the current request does the runtime load the full markdown body and any bundled scripts.
This is progressive disclosure at the capability layer rather than the tool layer. You can have a hundred skills installed and pay roughly nothing for the ninety-nine you do not use this turn. The same agent loaded with a hundred MCP tools would pay full schema cost for all hundred, every turn.
The tradeoff is real and worth naming. Skills are richer than tools (they can ship scripts, instructions, examples) but the model has to pick the right one based only on the short description. A long tail of skills with similar descriptions becomes a discovery problem that MCP’s flat tool list does not have, because in MCP the model always sees full schemas. For workflows where the model only needs a given capability ten percent of the time, skills win on context economy by a wide margin. For capabilities used on every turn, the per-load overhead of fetching the full skill body each time narrows the advantage.
The CLI argument is the most aggressive form of lazy loading, but only under two conditions that the original arguments tend to skip. The first is that the model already has the CLI in its training data. For mature, widely-used tools (gh, kubectl, aws, git, jq, psql), the model arrives knowing the flag conventions, common subcommands, and output shapes, and the baseline cost really is close to zero. For niche or recently released CLIs, the model has to learn the surface by running tool --help, then often tool subcommand --help, then often again on whatever sub-subcommand it discovers. That discovery process is itself an iterative tool-use loop, and the help text it pulls in can run hundreds to thousands of tokens per invocation. A niche CLI is not free, it just defers its cost to first use.
The second condition is that the CLI is actually available to the agent. CLIs are binaries that have to be installed, and the agent needs shell access to invoke them. That is a meaningfully larger deployment requirement than mounting an MCP server over HTTP. It implies a host the agent can write to and execute on, a sandbox around that host that prevents the agent from doing arbitrary damage, an auth state already provisioned on the machine, and a way to keep all of that consistent across sessions. None of that is impossible (Claude Code, Cursor agent mode, and the various coding-agent products solve it routinely) but it is a heavier substrate than a remote MCP endpoint. For agents that do not have a persistent shell, CLI is not on the menu at all.
When those conditions hold, Holmes’s argument is that the model’s existing CLI priors cover ninety percent of what an MCP server abstracts, with three concrete advantages MCP cannot replicate easily.
First, debuggability. When an agent does something unexpected with gh pr view, a human can run the same command and get the same output. The model and the operator share an interface. With MCP, the same situation involves spelunking JSON-RPC transport logs.
Second, composability, in a form that is convenient rather than unique. CLIs pipe naturally. The agent can run kubectl get pods -o json | jq '.items[].status' and let the shell do the filtering work that a naive MCP path would force into the model’s context. This is the same insight that drives code execution with MCP, where the script runs in a sandbox and returns only the summary, and the same insight that drives subagent architectures, where a sub-thread with its own context window does the heavy work and reports back. All three patterns achieve the same outcome: intermediate state stays out of the main agent’s context. CLI gets this for free because shells already exist. Code-execution MCPs and subagents get it via deliberate design. The CLI’s advantage here is ergonomic, not architectural.
Third, authentication is already provisioned, with significant caveats. The CLI’s auth (AWS profiles, kubectl contexts, GitHub tokens) already exists and works the same whether a human or an agent is invoking the binary. That is convenient. It is also overpermissive by default. The agent inherits the human operator’s full credential scope, including any permissions the current workflow does not need, and downstream audit logs attribute every action to the human rather than the agent. MCP’s per-server auth is more setup work, but the protocol’s authorization model lets the host issue narrowed credentials per integration, scope tokens to specific tools, and preserve agent attribution in audit trails. The CLI “just works” for capability and tends to “fail open” for governance: convenient in dev, harder to defend in a SOC 2 audit. This is the same problem we have covered in why agents have too much access.
CLI’s weaknesses are symmetric to its strengths. Without a standardized surface, every tool is its own discovery problem, and the cost of that discovery scales with how well-represented the tool is in the model’s training data. The “you have shell” model is hard to sandbox properly, and the inherited-credentials problem named above shows up again at the host level: an agent with a shell that has any human’s auth files in $HOME has, by default, everything that human has.
Once harness optimization and server design are factored in, the protocol-level gap between MCP, skills, and CLI narrows considerably for any given workload. A well-designed MCP server using the code-execution pattern can rival skills’ lazy-load profile. A skill packed with a long markdown body and a dozen bundled scripts can be heavier than a single tool. A niche CLI the model has to discover from scratch can be more expensive than either. The dominant source of variance is not which protocol you pick; it is how the server is designed and how aggressively the harness optimizes.
That reframes the HN debate. Each thread is implicitly arguing from an unrepresentative example. Willison’s piece is a Skills-versus-MCP-as-protocol argument, and the token-cost evidence he leans on is GitHub’s official MCP server at “tens of thousands of tokens.” But GitHub MCP is a naive wrapper around the gh CLI: a thin protocol layer that inherits the full surface of a 200-command CLI without using any of MCP’s own design levers (resources, code execution, scoped tool exposure). It is one of the worst MCP examples to extrapolate from. The underlying capability is gh itself, one of the most-trained-on, most-well-tested CLIs in existence, so a well-designed MCP server in that domain would mostly be re-inventing what the model already knows. Treating GitHub MCP’s token bill as evidence of “MCP is expensive” generalizes from an unrepresentative worst case. (CLI shows up in Willison’s argument as a thing skills can lean on, LLMs know cli-tool --help, rather than as a direct competitor.) Holmes’s CLI piece runs into the symmetric problem from the other side: he compares against the same naive MCP baseline, and the CLIs he names (gh, kubectl, aws, jq) are all mature, in-training-data tools that have benefitted from years of public help text. That is CLI at its best against MCP at its worst, with no examination of what happens when the underlying CLI is niche or when MCP is designed using its own primitives properly. Coffee’s MCP wins back when he factors in harness-level tool gating, but he is comparing harness-optimized MCP against unoptimized skills. All three authors are describing real workloads accurately, and they are correct about the answer for each workload. The mistake is generalizing from one workload to the verdict that one mechanism is the right default everywhere.
A related thread runs through the Recursive Language Models paradigm from Prime Intellect, which lets a model delegate context to Python sub-environments and sub-LLM calls. The point is not that RLM replaces MCP. The point is that the mechanism for capability delivery matters less than whether the substrate lets the model leave intermediate state out of its context and pull in only what it needs. The principle is one job per tool at the design layer, and “the harness is doing real work” at the runtime layer.
The right way to decide is to look at the workload and pick the mechanism whose cost curve fits it. A few rules of thumb hold up well in practice.
If the capability is used on most turns by an agent running in a modern harness, MCP is fine, because the per-turn schema cost either amortizes against the per-turn use or gets gated away by tool search. If the capability is used occasionally and the host is a Claude-family runtime, prefer skills, because they were built lazy for exactly that pattern. If the agent has a persistent sandboxed shell and the tool is mature enough to be in the model’s training data, prefer the CLI, because the model’s priors do the discovery work for free. If the tool is niche or the agent has no shell, CLI falls off the table. If the workload involves large intermediate results, prefer an MCP server designed around code execution, because the filtering cost should not pass through the model.
Underneath all of those rules, the same lesson keeps surfacing: server design and harness optimization dominate protocol choice. A team using MCP with code execution and Claude Code’s tool search will pay less than a team using skills naively. A team using skills with focused, short bodies will pay less than a team using flat-list MCP without any harness gating. The protocol is one variable among three, and not the largest.
Wire is an example of design choices that make the protocol cost stable. Each Wire container exposes exactly five MCP tools (wire_explore, wire_navigate, wire_search, wire_write, wire_delete) regardless of how many files or entries the container holds. The tool-schema cost is constant in container count, not in content volume, so a container with ten thousand entries imposes the same per-turn tool overhead as a container with ten. Container content lives behind the tools rather than inlined as schemas, which is the same logic Anthropic measured at 98.7 percent savings, applied at the service boundary. The relevant comparison is on agent efficiency, not on protocol preference.
The HN debate will keep running, and the threads will keep advocating for one mechanism over another. The more useful read is to treat each post as a description of the workload its author actually has, not as a universal recommendation. MCP, skills, and CLI all win some workloads. None of them wins every workload. Measure the baseline, account for the harness, pick by cost.
Sources: I still prefer MCP over skills · MCP is dead, long live the CLI · Claude Skills are awesome, maybe a bigger deal than MCP · Effective context engineering for AI agents (Anthropic) · Code execution with MCP (Anthropic) · Recursive Language Models (Prime Intellect)
Related
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container