Why MCP failures are a context engineering problem
Key takeaway
Adding a retrieval tool to an MCP server is supposed to increase decision overhead and tool calls. Our 64-question benchmark on the same container, same model, and same prompt found the opposite: splitting one overloaded tool into three single-purpose tools raised correctness 7%, cut total tool calls 24%, and reduced token spend 20%. The mechanism is that one job per tool removes mode-selection retries and lets agents traverse from a match instead of re-searching.
The standard advice for designing an MCP server is to keep the tool surface small. Fewer tools, fewer decisions, less overhead. Power them with mode or action parameters so one tool can cover several jobs.
Our benchmark data says that advice is wrong when the modes do different enough things.
In April 2026 we restructured the retrieval tools on every Wire container from two overloaded tools to three single-purpose ones and added a typed provenance object to every search result. We then replayed the same 64-question benchmark we have used since March against the same container, with the same model, the same prompt, and the same judge. The only variable was the tool surface.
The three-tool surface produced:
The agent made fewer calls, used fewer tokens, and answered more questions correctly. Adding a tool tightened the trajectory instead of expanding it. The rest of this post unpacks why that happened and what it suggests for anyone designing an MCP server for an AI agent.
More tools on an MCP server should increase decision overhead. That is the default assumption in most tool-design advice: an agent with ten tools deliberates more than an agent with three, and that deliberation costs reasoning tokens and turns. The conservative move is to consolidate several related jobs behind one tool with a mode or action parameter.
The old Wire retrieval surface followed that advice. wire_explore discovered the container schema. wire_search did everything else through a mode parameter: list, get, filter, text search, semantic search. Two tools, one of them overloaded five ways.
The failure mode we observed in practice was not deliberation cost. It was mode-selection error. The agent would call wire_search with the wrong mode, get back a result shaped unexpectedly, and issue another call to try again. One semantic question answered with list-mode output. One list query answered with a relevance-ranked blob. Each wrong-mode call burned a turn and accumulated conversation history that the next turn had to carry.
When we split the surface, the working hypothesis was narrower than “more tools are better.” It was that mode parameters on retrieval tools conflate distinct jobs, and that separating those jobs would reduce wrong-mode retries enough to offset any deliberation cost from having more tools to choose between.
We ran the same fixture against both surfaces. Here is the full result set.
| Metric | 2 tools (old) | 3 tools (new) |
|---|---|---|
| Correctness (1 to 5) | 4.47 | 4.78 |
| Completeness (1 to 5) | 3.91 | 3.88 |
| Faithfulness (1 to 5) | 5.00 | 5.00 |
| Couldn’t answer | 5 of 64 | 2 of 64 |
| Avg turns per question | 3.35 | 3.03 |
| Avg total tokens per question | 13,800 | 11,014 |
| Total tool calls across 64 questions | 199 | 152 |
The lift was not evenly distributed across question types. Breaking it down by tier is where the mechanism becomes visible.
| Question tier | 2 tools correctness | 3 tools correctness |
|---|---|---|
| T1 factual recall (22 questions) | 4.32 (3 failed) | 5.00 (0 failed) |
| T2 conceptual (22 questions) | 4.73 | 4.77 |
| T3 cross-document (20 questions) | 4.35 | 4.55 |
Factual recall went from 4.32 with three unanswerable questions to a perfect 5.00 with zero. Cross-document synthesis improved from 4.35 to 4.55. Conceptual questions, which were already close to ceiling, moved slightly. The full methodology and run conditions are documented in Wire’s agent efficiency benchmark.
The headline finding is not that the agent answered more accurately. It is where the accuracy came from and what it cost.
The hypothesis going in accounted for one mechanism: removing wrong-mode retries. The data shows two more that we had not modeled.
Factual questions were the tier where mode selection went most wrong under the old surface. A factual question like “what year did X happen in episode 142” has a clear right answer that lives in one chunk of one source. Under the old wire_search, the agent sometimes reached for mode text when mode semantic would have worked better, or vice versa. When the first call returned a poorly matched result, the agent retried with a different mode, burning a turn.
Under the new surface, wire_explore owns keyword text over classified entities and wire_search is exclusively hybrid semantic search over raw content. There is no mode to pick wrong. About 77% of wire_explore calls in the new run used its text mode, which the agent discovered on its own from the tool description without any prompt nudging. Factual recall moved from 4.32 to 5.00 and from three unanswered questions to zero, which is exactly what you would expect if a chunk of the old error was wrong-mode retries.
Fewer retries also means shorter conversation histories. Liu et al.’s Lost in the Middle finding (Stanford and Berkeley, 2023) shows that transformer models attend most strongly to the start and end of context and degrade on information placed in the middle. Every wrong-mode retry lengthens the context the next turn has to reason over, pushing earlier results into the attention-sparse middle. Cutting retries saves tokens, but it also keeps the content the agent actually needs inside the part of the context window it pays attention to.
The new tool, wire_navigate, takes a search result and expands around it. It can pull adjacent chunks from the same source, the full source the chunk came from, or related entries reachable through typed relationship edges: things that elaborate on the match, corroborate it, contradict it, or supersede it. That last point matters more than it sounds. The agent is not just pulling “related content,” it is pulling the epistemic relationship between pieces of content, so it can tell whether the second chunk reinforces the first or overrides it before committing to an answer. Cross-document synthesis, the tier most likely to benefit from traversal, improved from 4.35 to 4.55.
The mechanism is mechanical rather than magical. Without wire_navigate, every follow-up to a good match had to be another full semantic search, which meant new embeddings over the query, new retrieval over the corpus, and a new result set to reconcile with the one the agent already had. With wire_navigate, the agent reuses the match and asks “what is next to this” or “what else is like this” without paying for another full retrieval pass. The navigate calls split roughly evenly between siblings mode (adjacent chunks) and source mode (full source file). The agent was not guessing. Every wire_search match now includes hints that surface what is available for this specific chunk: whether siblings exist, how many, and which relationship edge types (elaborates, corroborates, contradicts, supersedes) connect it to other entries. That removes the black box of what is searchable as a vector, what is accessible as plain text, and what connects to what, and lets the agent choose between the topical graph (semantic neighbors) and the epistemic graph (typed relationships) based on the shape of the question.
Every match in the new surface comes back with a typed provenance object: source file, chunk index, total chunks, ingestion time, tags, section header, chunk summary, and a _meta block that states upfront whether navigate will find siblings or relationship edges for this match. Relationship edges carry an edgeType like elaborates, contradicts, or corroborates, and an edgeDirection so the agent can tell whether “X contradicts my chunk” or “my chunk contradicts X.”
Faithfulness stayed at 5.00 in both runs, so provenance did not move that score in our fixture. We read that as a property of the harness rather than an absence of signal: the prompt strictly required answers be drawn from retrieved content in both runs. Where typed provenance likely helps is further upstream, in letting the agent frame an answer around a specific chunk at a specific position in a specific source ingested at a specific time. It gives the agent the shape it needs to stay grounded without forcing the server to make editorial judgments about trustworthiness.
Provenance is structural metadata, not a quality verdict. The judgment that matters for an agent is origin, position, and connection, not whether the content is “good.”
The benchmark is a single data point on one container with one model. It is not a universal law. But it points toward a few design rules that fall out of the mechanism, not just the numbers.
When you find yourself adding a mode or action parameter to a tool, treat that as a signal that the tool is doing more than one job. If the modes behave differently enough that an agent could call the wrong one, they are different tools. In the Wire case, list, get, filter, text, and semantic collapsed into three verbs: discover (wire_explore), retrieve (wire_search), and traverse (wire_navigate). Each verb maps to one tool.
If your server exposes retrieval over chunked content, consider whether an agent has any way to expand around a match other than issuing another search. Walking adjacent chunks, pulling the full source, or following relationship edges from a specific match is almost always cheaper than reprocessing the query against the full corpus. If the answer to “get the paragraph before this one” is “do another similarity search and hope,” you are forcing the agent into an expensive pattern where a cheap one would do.
The agent in our benchmark was told nothing about which tool to call when. Its system prompt was a role, multi-turn permission, and “only use what the tools retrieved.” Everything else it learned from the tool descriptions that come back from MCP’s tools/list. The 77% wire_explore text-mode usage and the even split on wire_navigate modes were both discovered behaviors driven by descriptions alone. Your tool descriptions are not documentation for humans. They are the prompt the agent uses to plan. Treat them like that.
Anthropic’s engineering team made the same argument in their 2025 post on writing tools for AI agents: tool names, descriptions, and input schemas dominate agent behavior far more than most builders assume, and small description changes can produce large swings in tool-selection accuracy. Our run is consistent with that finding at the surface level: the only knob we turned was shape and description, and the agent’s trajectory changed substantially.
A match that is just an id, a score, and a content blob is incomplete. The agent has to reason about where the match came from, and if you do not tell it, it will either guess or avoid the claim. Formalize the shape of your search results so clients validate them and agents can ground on them without parsing loose metadata. For Wire this looks like structured context with explicit fields; the specifics will differ per server, but the shape is part of the tool’s contract, not a nice-to-have.
A few things the benchmark does not claim.
It does not claim that splitting any tool is always a win. The old wire_search had five modes that did materially different jobs. A tool with two closely related modes that return the same shape would not gain from the same split. The mechanism is reducing wrong-mode retries, not reducing tool count.
It does not claim the effect generalizes across models. We ran against Gemini 3 Flash. A model with stronger tool-planning behavior might show less of a lift from removing mode selection, because it gets the mode right more often to begin with. A weaker tool-planner might show more.
It does not claim provenance directly raised faithfulness. Faithfulness sat at 5.00 in both runs because the prompt constrained the agent to retrieved content. Provenance likely helps where answers involve contradictions, corroborations, or time-sensitive claims, which our 64-question fixture does not lean on heavily.
It does not claim that completeness is unaffected. Completeness moved from 3.91 to 3.88, which is flat within noise. Correctness and couldn’t-answer rate are the axes on which the two surfaces separated clearly.
The tool surface on an MCP server is part of the context engineering pipeline, not separate from it. Every tool description the agent sees goes into context. Every match shape the agent reasons about goes into context. Every wrong-mode retry puts another round of tool call and result into the conversation history that the next turn has to carry.
When MCP integrations fail, the underlying issue is usually context delivery rather than the protocol itself. The same framing applies to tool surface design. The question is not “how few tools can I expose” but “what context does the agent need to plan the next call, and what shape keeps each call one verb away from the answer.”
One job per tool, traversal when a match is already in hand, and typed provenance on every result together moved a real agent to answer more questions correctly in fewer turns on less context. On our fixture, on this container, under a fixed model and prompt. Replay it against yours and see what happens.
Sources: Wire: More tools, fewer calls, restructuring agentic retrieval · Anthropic: Writing tools for AI agents · Stanford: Lost in the Middle (arXiv:2307.03172) · Real Faults in Model Context Protocol Software (arXiv:2603.05637)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container