MCP (Model Context Protocol) AI Agent Context Engineering Context Window MCP Server

Tool calling is retrieval: what Needle proves at 26M

Jitpal Kocher · May 13, 2026 · 9 min read

Key takeaway

Cactus Compute released Needle, a 26-million-parameter model distilled from Gemini that handles tool calling at 6,000 tokens per second prefill and 1,200 tokens per second decode on consumer devices. The team's thesis is that tool calling is fundamentally retrieval and assembly, not reasoning, which is why an architecture with no feed-forward layers in the encoder can do the job at less than 1% of a frontier model's parameter count. For MCP and context engineering, the implication is that the router and the reasoner should be different models, with the small router taking the cost of tool selection out of the working set the big model has to attend to.

A team at Cactus Compute distilled Gemini’s tool-calling ability into a 26-million-parameter model called Needle, open-sourced the weights under MIT, and posted it to Hacker News, where it climbed to 557 points and 157 comments within a day. The model runs at 6,000 tokens per second prefill and 1,200 tokens per second decode on consumer devices. The headline claim is that it beats FunctionGemma-270M, Qwen-0.6B, Graninte-350M, and LFM2.5-350M on single-shot function calls, while using less than 1% of a frontier model’s parameter count.

The reason a model that small can do this is not a training trick. It is what the team explicitly states as their thesis: tool calling is fundamentally retrieval and assembly, not reasoning. If that framing holds, it changes how Model Context Protocol servers should be structured, how agent context budgets should be allocated, and which part of an agent stack actually needs to be expensive.

What Needle actually did

Needle is a 26M-parameter encoder-decoder with 12 encoder layers and 8 decoder layers. The encoder uses self-attention with rotary position embeddings. The decoder uses masked self-attention plus cross-attention back to the encoder. The notable architectural choice is what is missing: the feed-forward networks normally interleaved with attention in transformer blocks were removed from the encoder. One HN commenter summarized this as “attention is actually all you need,” and the framing is fair. When the job is to map an input to one of a finite set of named tools and extract structured arguments, the model does not need parameters to memorize world knowledge. It needs parameters to compute matches.

Training ran in two phases. Pretraining used 200 billion tokens on 16 TPU v6e units for 27 hours. Post-training used 2 billion tokens of synthetic single-shot function-calling data, generated by a larger model. The post-training corpus is open-sourced alongside the weights, which is unusual and worth attention: most function-calling fine-tunes ship the model and keep the dataset proprietary.

The model supports 15 tool categories including timers, messaging, navigation, and calendar entries. The team is explicit that it does not support multi-turn conversation, long-context reasoning, or workflows where state accumulates across calls. It is single-shot, which is the simplest version of tool calling, and where the retrieval framing is cleanest.

Why this works: tool calling is structurally a routing problem

A function call decomposes into three sub-tasks. First, match the user’s intent to the right tool name. Second, extract the argument values from the user’s request. Third, format those arguments in the schema the tool expects. None of these steps require generating a chain of thought. They require finding the right tool and assembling the right call.

This is structurally retrieval, not generation. The model is choosing from a fixed set of tools, and the choice can be evaluated by exact match: was the right tool called with the right arguments, yes or no. Compare this to summarizing a document, writing a function, or planning a sequence of actions, where there is no single correct output and the model has to compose a response token by token from a much larger space.

A retrieval task does not need a model with deep world knowledge or strong generation capabilities. It needs a model that can attend to the right parts of the input and map them to the right output structure. That is what Needle’s architecture optimizes for, and that is why the parameter count can drop two orders of magnitude without the headline accuracy collapsing.

The implication is that the part of an agent stack that does tool calling has been overspecified for years. Sending every function-call decision through a frontier model uses billions of parameters of capacity to do a job that millions of parameters can do, and then pays for the cost difference per call.

The catch: routing is not reasoning, and neither is enough alone

The HN thread on Needle surfaced the limits clearly. One commenter prompted the model with “I need to contact my boss I will be late” and got a timer call back. Another asked whether the model could disambiguate “let’s catch up at coffee tomorrow at 10:00” between an add-appointment tool and a reminder tool when the tool surface has hundreds of options. A third pointed out that single-shot tool calling falls apart once workflows become multi-step and state accumulates across calls.

These are not bugs in Needle specifically. They are the boundary where retrieval-and-assembly stops being enough. When the user’s intent is ambiguous, deciding between tools requires modeling the user’s goal, which is reasoning. When state accumulates across calls, deciding what to do next requires modeling the conversation history, which is also reasoning.

The correct conclusion is not “small models can replace large ones for agents.” The correct conclusion is that an agent stack has two structurally different jobs: routing the next action to the right tool, and reasoning about what action is the right one given the goal and the state so far. Today most stacks collapse these jobs into one model. Needle is evidence that the routing job can be peeled off and given to a small specialized model, leaving the reasoning model to do less and do it on a smaller working set.

The framing also clarifies why function-calling benchmarks have been such an unreliable proxy for agent quality. A model that scores well on a single-shot function-call benchmark is being measured on the routing task in isolation, with intent already unambiguous and no prior state. That benchmark grades retrieval, not reasoning. When the same model is dropped into a real agent loop, the part that fails first is not tool selection but goal modeling: choosing the right tool given a vague request, deciding when to stop calling tools, and recognizing that the user’s previous answer changed the plan. Needle’s success on its target benchmark and the HN failure cases are consistent with each other once the two jobs are kept distinct.

What this changes for MCP and context engineering

The static context cost of an MCP-connected agent is dominated by tool definitions. Anthropic’s progressive tool loading work reports input tokens dropping from roughly 150,000 to 2,000 when tool definitions are deferred until needed. Most of that 150,000 is the full tool catalog in the system prompt, sitting in the context window for every single inference even when the current step needs only one or two tools.

Separating routing from reasoning attacks the same problem from a different angle. If a small router model owns tool selection, the frontier reasoning model never sees the full tool catalog at all. The router’s working set is the catalog plus the current user message. The reasoner’s working set is the conversation history plus the result of the tool call. Neither working set is the union of both, which is what frontier-model-as-router stacks pay for today.

Wire’s MCP surface ships five single-purpose tools (wire_search, wire_navigate, wire_write, wire_delete, wire_analyze), and a 64-question benchmark on that surface showed total tool calls drop 24% versus an overloaded design with mode parameters, with token spend dropping 20% on the same task set. The result compounds with router-style separation, because the routing layer no longer has to disambiguate between tools that overlap in description. One job per tool is the structural property that makes both human-readable schemas and small-router-readable schemas work; the test for a well-designed MCP surface is whether a 26M-parameter model could pick the right tool from the catalog without seeing the full reasoning context.

For context engineering, this is a clean specialization. The router pays a small fixed cost for routing. The reasoner pays a variable cost for the part of the work that actually needs a frontier model. The total cost of an agent run drops not because any individual call is cheaper, but because the calls are sized to the job they are doing.

What to do today

The Needle result is not a recommendation to replace your agent’s model. Needle is single-shot, English-only by default, and explicitly does not support multi-turn workflows. What it does support is a hypothesis you can test on your own stack right now.

Audit your MCP tool surface for ambiguity. If two tools have overlapping descriptions, or one tool has a mode parameter that controls fundamentally different behavior, those are the places a small router would fail first. They are also the places a large model is currently spending tokens disambiguating. Both audiences benefit from the same fix.

Then look at where your agent spends its tokens. If the system prompt carries a tool catalog larger than the user’s current message, you are paying frontier-model rates for a job a smaller model can do. Progressive tool loading, layered router-reasoner architectures, and disciplined tool design are all variations on the same insight: not every part of agent work needs the same amount of capacity. The Needle paper is a public proof that the routing layer specifically can be made much smaller without breaking, when the tool surface it routes against is clean.

The thing to watch over the next quarter is whether router-class small models start shipping inside production agent frameworks as a default. The economic argument is strong, the architectural argument is now public, and the open weights remove the integration barrier. The remaining work is on the MCP server side, where ambiguous tool surfaces still force the reasoning model to step in. Cleaning that up is where most agent cost savings live in 2026.

Sources: Needle on GitHub · HN discussion: Show HN: Needle · Anthropic: progressive tool loading and code-execution-with-MCP · Wire MCP tool benchmark

Frequently asked questions

How is tool calling different from reasoning?

Tool calling decomposes into three sub-tasks: match an intent to a tool name, extract the arguments from the user request, and format them in the schema the tool expects. None of these steps require chain-of-thought generation; they require retrieval of the right tool and assembly of the right call. Reasoning happens before and after the tool call, not inside it.

Can small models handle MCP tool selection in production?

For single-shot tool selection with unambiguous user intent, the Needle result and similar small function-calling models suggest yes. The failure cases that HN commenters surfaced were ambiguous user intent (timer versus email when the user said 'contact my boss') and multi-step workflows where state accumulates across calls. In production, a small router model paired with a larger reasoning model is the architecture that has worked best so far.

Does separating tool routing from reasoning save tokens?

Yes, in two ways. The big reasoning model no longer needs the full MCP tool catalog in its system prompt, which is one of the largest static-context costs in modern agents. And every tool selection no longer pays the latency and per-token cost of a frontier model. Anthropic's progressive tool loading work reports a drop from roughly 150,000 input tokens to 2,000 for similar workloads.

What kind of MCP tool design works best with a small router?

Tool surfaces with one job per tool, named clearly, with non-overlapping descriptions. Overloaded tools with mode parameters or fuzzy boundaries between similar tools are where small routers fail first. The same design principle that helps frontier models also helps small routers, which is why MCP servers built for clarity tend to compose well with both.

Is Needle's distillation approach legal under the Gemini terms of service?

This is contested. Google's Gemini API terms prohibit using Gemini outputs to train competing models. Cactus has stated that no weight extraction occurred and that the training data is synthetic function-calling examples generated separately. The legal question depends on how 'distillation' is interpreted in the specific terms of service in force when the data was collected, and is not resolved at the time of publication.

MCP (Model Context Protocol) MCP Server

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container

Tool calling is retrieval: what Needle proves at 26M

What Needle actually did

Why this works: tool calling is structurally a routing problem

The catch: routing is not reasoning, and neither is enough alone

What this changes for MCP and context engineering

What to do today

Frequently asked questions

Related articles

MCP vs Skills vs CLI: which one wastes the least context?

Progressive tool loading is the new MCP context pattern

What MCP's 2026 roadmap means for context delivery

Ready to give your AI agents better context?