Benchmark · April 20, 2026

More tools, fewer calls: restructuring agentic retrieval for Wire containers

We restructured Wire's MCP tool surface from two overloaded tools to three single-purpose ones and added typed epistemic provenance to every result. Across 64 queries on the same container: +7% correctness, 60% fewer failed retrievals, 10% fewer turns, 20% lower token cost per question.

What changed in the tools

Before

Two tools, one overloaded

wire_explore discovered the container schema. wire_search did everything else through a mode parameter: list, get, filter, text search, semantic search.

Agents had to pick the right mode. When they didn't, a call came back wrong-shaped and they had to try again, burning a turn and tokens.

After

Three tools, one job each

  • wire_explore now owns structured access: schema, list, get, filter, keyword text over classified entities.
  • wire_search is exclusively hybrid semantic retrieval over raw content. No modes.
  • wire_navigate is new. From a search result, jump to adjacent chunks, the full source, or related entries.

Hypothesis. Adding wire_navigate plus typed provenance on every result gives the agent a tighter, more grounded way to expand around a good match instead of firing another search. Splitting the retrieval tools so each does one job reinforces this by giving the agent a clearer map of what to call next.

Test. Same container, same questions, same judge. Hold model, data, and prompt constant; vary only the tool surface.

What we tested

Held constant across both runs
Fixture
64 questions across three tiers: factual recall (22), conceptual (22), cross-document synthesis (20). One container holding 286 podcast-transcript episodes.
Agent loop
Gemini 3 Flash with up to 7 turns per question. Minimal system prompt: role, multi-turn permission, only-use-retrieved-context. No prescriptive playbook about which tool to call when.
Tool descriptions
Came from MCP's tools/list, exactly what a real MCP client would see. No custom prompt engineering to nudge the agent toward any particular tool.
Judge
A separate Gemini 3 Flash pass scored each answer on correctness, completeness, and faithfulness (1–5). Judge saw the question and expected answer, but not which run produced the response.
The one thing that varied
2 tools
  • wire_explore (schema discovery)
  • wire_search (5 modes: list, get, filter, text, semantic)
3 tools, with wire_navigate
  • wire_explore (structured canonical access)
  • wire_search (hybrid semantic over raw content)
  • wire_navigate (traverse from a match)

Results

64 questions, averages across all tiers. Green cells mark the best performer per row.

2 tools 3 tools
with wire_navigate
Correctness 4.47 / 5 4.78 / 5
Completeness 3.91 / 5 3.88 / 5
Faithfulness 5.00 / 5 5.00 / 5
Couldn't answer 5 / 64 2 / 64
Avg turns per question 3.35 3.03
Avg total tokens consumed 13,800 11,014
+7%
correctness vs the old surface, same 64 questions.
−60%
fewer failed retrievals. 5 couldn't answer dropped to 2.
−20%
tokens consumed per question. Fewer calls, tighter turns.

Where wire_navigate pays off

Breaking the run down by question tier shows exactly where the new tool earns its keep.

Correctness by tier 2 tools 3 tools
with wire_navigate
T1 · factual (22) 4.32 · 3 failed 5.00 · 0 failed
T2 · conceptual (22) 4.73 4.77
T3 · cross-document (20) 4.35 4.55

Factual recall went from 4.32 to a perfect 5.00. Every factual question, answered correctly, every time. Three failures before, zero after. One job per tool lets the agent find answers it was previously missing while wrestling with the mode parameter puzzle in the old wire_search.

Cross-document is where wire_navigate earns its keep. When the answer is scattered across multiple sources, the agent lands on a relevant passage via wire_search, then pulls siblings, the full source, or related entries via wire_navigate to stitch the rest together. More targeted than issuing another full semantic search.

What the agent actually called

Total tool invocations across all 64 questions. Same agent, same prompt, different tool surface.

2 tools
199
total tool calls across 64 questions (3.11 / question)
3 tools, with wire_navigate
152
total tool calls across 64 questions (2.38 / question)

The headline: the three-tool agent made 47 fewer total calls than the old two-tool agent (152 vs 199). Adding a tool didn't bloat the agent, it tightened it. Each call more targeted, fewer redundant retries.

When we looked at what the agent was doing, one thing stood out. Nobody told it when to use which tool. Its system prompt said "you have access to a Wire container through MCP tools" and that was it. Everything else it learned from the tool descriptions themselves.

About 77% of wire_explore calls were mode: "text". The agent discovered on its own that keyword matching across classified entities was cheaper and more precise for lookup-style questions than a full semantic search. Nobody prompted it to do that. It read the tool descriptions and figured it out.

The wire_navigate calls split evenly between siblings mode to grab adjacent chunks and source mode to retrieve the entire source.

Every match is sourced

Alongside the tool split, we formalized what a search result looks like. Every match now comes back with a typed provenance object so agents have a proper epistemic footing before they reason.

{
  id: "4d8a4ad4-66da-4c20-9366-21a378357582",
  score: 0.046,
  content: "...",
  provenance: {
    source: "file:example.txt",
    sourceFileId: "BLjdInPD6UvbhcFZ",
    chunkIndex: 15,
    totalChunks: 33,
    ingestedAt: "2026-04-15T17:42:37Z",
    tags: ["chunk"],
    fileName: "example.txt",
    sectionHeader: "...",
    chunkSummary: "..."
  },
  _meta: {
    wire: {
      navigate: {
        hasSiblings: 32,
        relationshipTypes: {
          elaborates: 4,
          corroborates: 1
        }
      }
    }
  }
}

That object is epistemic provenance. Not an editorial judgment about whether the content is trustworthy. That's the agent's job. The judgment that matters here is structural: what source did this come from, what's its position inside that source, when was it ingested, and what else in the container relates to it and how.

When wire_navigate walks relationships, each returned edge carries edgeType and edgeDirection. The agent can tell the difference between "X contradicts my chunk" and "my chunk contradicts X," which reshapes how it frames the answer.

That's why faithfulness stayed pinned at 5.00 across every run while correctness climbed. Provenance isn't just decoration on the response. It gives the agent the shape it needs to stay grounded. Every claim can be tied back to a specific position in a specific source, ingested at a specific time, with a specific kind of connection to the question.

Wire doesn't tell your agent what to believe. It tells your agent where things came from.

What this means

Across 64 questions with fixed data and model, the three-tool surface produced higher correctness (4.78 vs 4.47) at lower cost: fewer tool calls per question (2.38 vs 3.11), fewer turns (3.03 vs 3.35), and ~20% lower token spend (11,014 vs 13,800). Completeness was flat (3.88 vs 3.91).

Faithfulness was 5.00 in both runs. We read this as a property of the harness rather than an absence of signal: every answer was generated strictly from retrieved content in both conditions, so there was little surface for unfaithful claims to emerge. Correctness and completeness were the axes on which the two surfaces could separate.

Adding a tool did not degrade efficiency. Under the hypothesis that additional tools increase decision overhead, the expected outcome was higher call counts or longer trajectories. We observed the opposite: with one job per tool and descriptions aligned to tool behavior, the agent reached a terminating retrieval earlier.

The observed effect is attributable to tool surface shape rather than model, data, or prompt changes, all of which were held constant.

How we ran this

Dataset. The same 64-question fixture we've used for retrieval benchmarks since March, targeting a Wire container with transcripts from a major product-focused podcast (Lenny's Podcast, 286 episodes). 22 factual, 22 conceptual, 20 cross-document questions, each with an expected answer.

Two runs. The "before" baseline used the two-tool surface shipped in April 2026 (wire_explore for schema, wire_search with five modes for everything else). The "today" run uses the new three-tool surface: wire_explore with its own mode parameter for structured access, wire_search for hybrid semantic retrieval over raw content, and wire_navigate for traversing from a search result. Container, fixture, and judge model are identical across both runs.

Agent loop. Gemini 3 Flash acts as the retrieval agent with up to 7 turns per question. Its system prompt is minimal: research-assistant role, multi-turn permission, and "only use what the tools retrieved." Tool descriptions come from MCP's tools/list, same as any MCP client would see. No prescriptive playbook, no "call X then Y" instructions.

Evaluation. A separate Gemini 3 Flash call scored each answer for correctness, completeness, and faithfulness on a 1-5 scale. The judge saw the question, the expected answer, and the generated answer, but not which run produced it.

Token accounting. "Avg total tokens consumed" is the sum of input tokens across every Gemini call in the loop for one question, including the system prompt sent each turn, tool schemas, conversation history, and tool results. Lower numbers mean the agent reached the answer with less back-and-forth.

Frequently asked questions

About the benchmark and Wire.

What is Wire?
Wire is a context-as-a-service platform. You upload files into a container, Wire processes them into AI-optimized context, and any AI tool can access that context through MCP (Model Context Protocol). Think of it as shared, structured memory for your AI tools.
What's the Retrieval and Navigation Model?
Every Wire container used to have two retrieval tools: wire_explore for schema discovery, and wire_search for everything else (list, get, filter, text, semantic), through a single mode parameter. We split that into three tools with clear jobs: wire_explore owns structured canonical access, wire_search is hybrid semantic retrieval over raw content, and wire_navigate traverses from a search result to adjacent chunks, the full source, or related entries. One job per tool, and the agent decides when to call which.
Why did adding a new tool reduce the number of tool calls?
The primary reason is that wire_navigate lets the agent land on a relevant passage once and then traverse from it: pull adjacent chunks, read the full source, or follow relationship edges. Without it, each follow-up had to be a fresh search. Secondarily, splitting the old five-mode wire_search into single-purpose tools removed a class of wrong-mode retries. First call is more likely to be the right one. Net result: fewer calls, fewer reasoning tokens, better answers.
How did you measure this?
Same 64 questions across three tiers (factual, conceptual, cross-document) on the same container. A Gemini agent with a max of 7 retrieval turns on each question, then an independent judge model scored correctness, completeness, and faithfulness on a 1-5 scale. The only thing we changed between the two runs was the retrieval tool surface the agent had access to.
Can I run this benchmark on my own data?
Yes. Create a free Wire account, upload your files into a container, and connect your AI tools. You get 3,000 free credits to start, no credit card required.

Give your agents the new tool surface

Every container gets wire_explore, wire_search, and wire_navigate automatically.

3,000 free credits. No credit card required.

Create Your First Container