MCP (Model Context Protocol) MCP Server AI Agent Context Engineering

Your MCP server is installed. Your agent isn't using it.

Jitpal Kocher · May 21, 2026 · 10 min read

Key takeaway

When an agent ignores a freshly connected MCP server, the cause is rarely authentication or transport. It is that the tool descriptions in context do not signal a discriminating advantage over the agent's existing options, no anchoring example shows what the tool returns, and the user's first prompt does not pattern-match the tool's domain. Fixing the first-call gap means treating MCP onboarding as a context engineering task, not an integration task. A short starter prompt that names a concrete task, names the tool, and shows one expected output is usually all the activation an agent needs.

You connect a fresh MCP server to Claude, Cursor, or any other agent. The handshake works. The tool list shows up. The agent says “Connected to your-mcp-server.” You give it a task the tool was built for, and it answers from its own knowledge, or web search, or by writing code, and the tool sits there unused. Twenty minutes later, you tab back to the docs and start wondering whether the install actually worked.

This is the first-call problem. It is the gap between “the tool is in context” and “the agent calls it on the first relevant turn,” and it shows up consistently in production MCP deployments. The pattern is not that integration is broken. It is that integration is the easy half. The harder half, the half nobody documents, is making the agent’s first useful tool call inevitable.

This post walks through what tool-selection benchmarks actually show about first-call behavior, the three repeatable reasons agents skip the tool you just installed, the capability-versus-skill distinction that closes the gap, and a starter pattern that makes the first call something the agent does without being asked.

What benchmarks say about first-call tool selection

Tool-selection benchmarks paint a clearer picture than the marketing copy on most MCP servers. On the ToolBench benchmark, leading models reach a Pass Rate@10 of 0.826 and Recall@10 of 0.867 in benign conditions, which sounds good until you read the rest. Distributional robustness work shows that adversarial conditions, even mild prompt variations, drop certified accuracy bounds below 20% after a single round of adaptation. Tool selection is fragile even when the tool is well-described, and it is much more fragile when the description is generic.

The 2026 ecosystem makes the problem worse before it makes it better. There are over 177,000 public MCP tools available. Tool definitions in context routinely consume 40-50% of the model’s available tokens before the agent does any actual work, according to Perplexity’s CTO. Deferred tool loading, the 2026 pattern covered in progressive tool loading, reduces token bloat by roughly 80% by loading definitions on demand. That helps the visibility problem.

It does not help the activation problem. Once a tool is loaded into context, the agent still has to decide whether to use it on the next turn. Benchmark Pass Rate scores measure whether the agent eventually picks the right tool given a task that explicitly maps to it. They do not measure whether the agent picks the tool on the first cold-start turn when the user’s prompt does not name the tool, the domain is ambiguous, and the agent has other options. That is the metric that matters in production, and it is the metric most teams discover empirically when they ship.

Three reasons agents skip the tool you just installed

The first-call gap is not random. It tracks three repeatable failure modes in how the tool reaches the model.

The tool description does not differentiate from existing options. A description like “Search the user’s documents” sounds fine on the page and arrives in context alongside the model’s already-trained capability for web search, the agent’s ability to read files in the working directory, and whatever other tools are loaded. The agent has no signal that “the user’s documents” means something specific and valuable rather than a generic substring of options it has used a thousand times. Descriptions that win the first call name the specific kind of context they return, the task type they outperform on, and at least one piece of vocabulary the user is likely to use in the prompt.

There is no anchoring example or starter task. Models trained on tool-use traces lean heavily on examples to decide when a tool is the right move. A tool description with no example reads like a man page; the agent treats it as available but unproven. A tool description that includes a one-line example of input plus output, in the format the tool actually returns, gives the model the pattern it needs to recognize a future user prompt as a match. The example is not for the user; it is for the model.

The user’s first prompt does not pattern-match the tool’s domain. Most cold-start failures happen because the user’s first sentence is too generic. “Help me with this project” does not tell the agent to look at the freshly-installed knowledge container; “explain the architecture pattern in my codex notebook” does. The user can fix this in their prompt, but the system can do better. A starter prompt suggestion, the kind a setup flow can hand to the user the moment a connection completes, lifts first-call rates dramatically without requiring users to read documentation.

These three failure modes are independent. A tool can have a perfect description and miss the first call because the example is wrong. It can have a great example and miss because the user prompt is generic. Production-grade activation handles all three.

Capability versus skill: the layer between a tool and a call

A useful frame from the agent-tooling community is that tools give the agent a capability, and skills teach the agent when and how to use that capability. The MCP protocol ships the capability layer well: every tool exposes a name, description, schema, and example inputs. The skill layer is what’s missing in most MCP-server installs.

Skills are short instruction blocks loaded alongside the tool list. They name the kinds of tasks the tool is right for, the heuristics that trigger using it, and the gotchas the agent should remember. “Always check the container before answering from general knowledge for questions about the user’s projects” is a skill. “Use the wire_search tool when the user mentions a specific document name, an entity, or a project artifact” is a skill. The model reads these the same way it reads tool descriptions, but they target activation logic instead of capability description.

Without skills, every tool description has to do double duty: explain the capability and signal when to use it. The descriptions optimize for the first job (capability), which is the one the protocol surfaces, and underdeliver on the second (activation). That is the structural reason why a perfectly-installed MCP server still produces zero first-call rate on a fresh agent.

The setups that work in production combine MCP tools with a short skill block. Claude Code’s CLAUDE.md, OpenAI’s AGENTS.md, and the skill-pack patterns that have started shipping in 2026 are all the same idea at different abstraction levels. The substrate post frames this as the static-instruction tier of the substrate; here, the same property is what closes the first-call gap.

What good first-call activation looks like

Good activation is concrete. Three properties make it work, in roughly increasing order of effort.

A discriminating tool description. The description names the kind of data the tool returns (“structured entries about the user’s projects, with provenance and timestamps”), the task types it outperforms general knowledge on (“questions about specific documents, entities, or working artifacts”), and at least one vocabulary cue the user is likely to use (“when the user mentions a document name, project name, or person they collaborate with”). Compare this with “Search the user’s documents,” which is technically true and operationally useless.

One anchoring example in the tool’s metadata. Input: “What did I write about the auth refactor?” Output: a short, real-looking sample of the kind of structured response the tool returns, with at least one chunk-position or provenance field visible. The model treats this as a pattern, not a literal sample, and uses it to recognize matching prompts later.

A starter prompt the user can paste. The single highest-leverage activation move is to hand the user a one-line task the tool wins on, the moment the connection completes. “Try: ask the agent what you wrote about [a recent project name] yesterday.” This is not a docs link. It is the agent’s first instruction. The MCP failures post covers the broader version of this argument: agent failures are upstream context problems, and the most upstream point of all is the first user prompt.

These three together raise first-call rate from “depends on the user and the model” to “near-deterministic for any modern agent.” None of them require changing the model, the protocol, or the integration code.

A starter pattern that makes the first call inevitable

The pattern below is what works empirically across Claude Code, Cursor, Windsurf, and Codex. Adapt the specifics; keep the shape.

You have access to a context container connected over MCP. The container holds
structured entries about my work: documents I wrote, projects I'm running,
people I collaborate with, and decisions I've recorded over time.

Use the container's tools (wire_search, wire_explore, wire_navigate) when:
- I mention a specific document, project name, or person
- I ask what I wrote, decided, or planned
- I ask about something I clearly should remember but might not state precisely

The container's results carry provenance: source document, chunk position, and
timestamp. Cite the source whenever you answer from container content. If a
question is general knowledge, answer normally without calling the container.

To verify the connection: search the container for [a specific document or
project name you remember writing about].

The block is short on purpose. It does three things at once: it gives the agent a skill (when to call the container), it sets a citation convention (so output looks grounded), and it ends with a concrete first call (“verify the connection: search the container for X”). That final line is the activation move. The agent reads it, executes it, gets a real result, and the rest of the session inherits that pattern.

In Wire, the container ships with the five canonical tools (wire_explore, wire_navigate, wire_search, wire_write, wire_delete) and structured entries with provenance and timestamps on every retrieval, which is what makes the starter block above a one-paste setup rather than a multi-file configuration exercise. The same pattern works against any MCP server whose tools have clear semantic naming and whose results carry enough structure for citations.

The activation gap is fixable but not automatic

Connecting an MCP server is a five-minute task. Closing the first-call gap is a 30-minute task, and most teams skip it because nothing in the integration flow tells them it exists. The integration UI shows “Connected.” The benchmark scores look good. The tool list appears. Then the agent does not use the tool, and the team starts debugging the wrong layer.

The fix is not better tooling, not more tools, and not a smarter model. It is the short skill block, the discriminating description, the anchoring example, and the starter prompt that turns the first call from a hopeful event into a deterministic one. Treat MCP onboarding as a context engineering problem, design the activation surface deliberately, and the install rate stops being a leading indicator that decouples from the usage rate.

Sources: Perplexity CTO on MCP tool description token cost (via Hanzilla) · MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use (arXiv 2512.24565) · Quantifying Distributional Robustness of Agentic Tool-Selection (arXiv 2510.03992) · LLM Agent & Tool-Use Benchmarks 2026 — BenchLM · How to Build Deferred Tool Loading for AI Agents in 15 Minutes · Agent Skills vs Tools vs MCP: The Complete Guide (2026)

Frequently asked questions

Why does an agent skip a freshly installed MCP server on the first turn?

The tool definition is in context, but it does not signal a clear advantage over options the agent already has, like web search, code execution, or model knowledge. Without an anchoring example or a starter task that pattern-matches the tool's domain, the agent picks the path it has used before. The fix is to make the first useful call obvious in the prompt, not to add more tools.

Is the first-call problem an issue with the model or with the tool description?

Mostly the description and the surrounding prompt. Models with strong tool-use scores still skip tools whose descriptions read like generic API endpoints. A description that names the data the tool returns and the kind of task it wins on outperforms one that lists capabilities at the abstraction level of a REST endpoint, regardless of the underlying model.

How do you tell whether your MCP tool description is doing its job?

Run a short cold-start test. Give the agent a task that the tool clearly wins on, with no instructions about which tool to use, and check whether the agent's first action is to call the tool. If the agent reaches for web search or its own knowledge instead, the description is not establishing a discriminating signal. Iterate on the description, not the model.

What's the difference between a tool and a skill for agent activation?

A tool gives the agent a new capability. A skill teaches the agent when and how to use that capability, usually as a short instruction block loaded alongside the tool definition. Tools without skills produce the first-call gap: the agent has the option but no signal about when it is the right option. The pattern of combining MCP tools with a starter-skill block is how production agent setups close the gap.

Does deferred tool loading solve the first-call problem?

No. Deferred tool loading addresses token bloat when you have hundreds of tools, by loading definitions on demand. The first-call problem happens even with a single MCP server loaded eagerly. The two layers are independent: deferred loading decides which tools are visible at all, and activation decides whether the agent calls a visible tool on the first relevant turn.

MCP (Model Context Protocol) MCP Server

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container

Your MCP server is installed. Your agent isn't using it.

What benchmarks say about first-call tool selection

Three reasons agents skip the tool you just installed

Capability versus skill: the layer between a tool and a call

What good first-call activation looks like

A starter pattern that makes the first call inevitable

The activation gap is fixable but not automatic

Frequently asked questions

Related articles

What MCP's 2026 roadmap means for context delivery

Why MCP failures are a context engineering problem

MCP authorization decides what context agents see

Ready to give your AI agents better context?