Retrieval Benchmarks

What we tested

13,643

Entries from 287 files. Mostly podcast transcripts, plus product development, engineering, growth anecdotes, user research, and other real-world materials.

Questions across three difficulty tiers: factual recall (22), conceptual understanding (22), and cross-document synthesis (20)

Each retrieval pipeline was given a question, retrieved context from the same dataset, and a non-frontier model (2026) generated an answer strictly from the retrieved context. An independent evaluator then scored the answer on three dimensions:

Correctness 1–5

Does the answer contain the right information?

Completeness 1–5

Does it cover all key aspects of the expected answer?

Faithfulness 1–5

Was the answer generated from the retrieved context, not from the model's training data? This is a grounding check, not an accuracy check. A score of 5 means the model only used what was retrieved, even if what was retrieved was wrong.

Answered vs. declined: Both pipelines always returned context from the retrieval step. But when the returned context wasn't relevant enough to answer the question, the model correctly declined rather than guessing. "Declined to answer" means the retrieval pipeline returned results, but nothing useful. These are scored 0 correctness, 0 completeness, and 5 faithfulness. The difference between "answered" counts shows where one pipeline returned relevant context and the other didn't.

The approaches

Baseline

RAG (Vector Search)

Standard retrieval-augmented generation. Embed the query, return the closest matching chunks. This is what most AI systems do today.

Wire

Wire Semantic Search

Wire's retrieval, accessed through a single wire_search tool call. The container's architecture (entity extraction, schema discovery, and intelligent indexing) handles the retrieval.

Results: RAG vs Wire Semantic

Overall scores across all 64 questions.

	RAG	Wire Semantic
Correctness	2.98 / 5	3.73 / 5
Completeness	2.56 / 5	3.22 / 5
Faithfulness	5.00 / 5	4.97 / 5
Answered from context	41 / 64	51 / 64
Declined to answer	23 / 64	13 / 64
Avg context tokens	3,596	3,468

+25%

overall correctness improvement

24% more

questions answered. 51 vs 41 out of 64.

Same tokens

better results. ~3,500 tokens each, 25% better answers.

Both pipelines scored near-perfect faithfulness (5.00 and 4.97). The model wasn't using training data in either case. The difference is entirely in retrieval quality.

Both pipelines returned roughly the same volume of context per query (~3,500 tokens). But Wire's context was more relevant. 25% better correctness from the same token budget. More signal per token.

Results by question type

Factual Recall

22 questions

RAG 3.36

Wire 4.50

Answered: RAG 15/22, Wire 20/22

Conceptual

22 questions

RAG 4.05

Wire 4.73

Answered: RAG 18/22, Wire 21/22

Cross-Document

20 questions

RAG 1.40

Wire 1.80

Answered: RAG 8/20, Wire 10/20

Why cross-document questions are hard

Cross-document questions require connecting information across multiple sources. A single retrieval call returns results for one query, which often misses the second or third piece of the puzzle. Both pipelines struggled here. RAG answered 8 of 20, Wire answered 10 of 20. Single-pass retrieval, whether RAG or Wire Semantic, has fundamental limits when the answer lives across multiple documents. This is where agentic retrieval changes the picture.

Token efficiency

Wire returns the same amount of context as RAG. The difference is what's in it.

RAG (Vector Search)

3,596

avg tokens per query

2.98 correctness

Wire Semantic

3,468

avg tokens per query

3.73 correctness

Both pipelines return roughly 3,500 tokens of context per query. The token volume is nearly identical. But Wire's context produces 25% better answers because the tokens themselves are more relevant.

RAG retrieves the closest vector matches, which often include loosely related passages that dilute the useful information. Wire's container architecture (entity extraction, schema discovery, intelligent indexing) surfaces more precisely targeted context. Same token budget, more signal per token.

For teams monitoring token costs, this matters. You're not paying for more context. You're getting better context.

All Wire tools, working together

What happens when an agent uses Wire

Wire containers are designed for AI agents. Here's what happens when one actually uses all the tools.

We gave an AI agent access to the full suite of Wire container tools: wire_explore for schema discovery and wire_search for all retrieval. The agent explored the container, queried data across multiple turns, and iterated based on what it found. The results come from every Wire tool working in concert, not any single tool in isolation.

The agent: A custom retrieval agent with a maximum of 7 turns. It was instructed with the available tools, encouraged to break down complex problems into smaller queries, and encouraged to stop when it believed it had enough context to answer. The same agent was used for both Agent + RAG and Agent + Wire, the only difference was which tools it had access to.

	Single-shot retrieval		Agentic (up to 7 turns)
	RAG	Wire Semantic	Agent + RAG	Agent + Wire
Correctness	2.98	3.73	4.53	4.78
Completeness	2.56	3.22	3.97	4.05
Answered	41 / 64	51 / 64	60 / 64	62 / 64
Avg tokens per turn	n/a	n/a	4,369	2,918

62 / 64

questions answered

Agent + Wire answered 97% of all questions. Single-pass RAG managed 64%. The agent used an average of 5.5 tool calls per query, exploring the schema, searching structured and unstructured data, and iterating based on what it found.

cross-document improvement

Cross-document correctness jumped from 1.40 (RAG) to 4.40 (Agent + Wire). Questions that require connecting ideas across multiple sources went from mostly unanswerable to near-perfect.

What this means

Wire containers are the best place to put data that your AI tools need to access. Upload your files once, share them with your team, and every AI tool connected to Wire gets better answers.

Same 13,643 entries. Same 64 questions. In single-shot retrieval, Wire delivered 25% better correctness from roughly the same token budget. When an agent had access to the full container, it answered 62 of 64 questions, including cross-document questions that single-pass retrieval couldn't touch. And each tool call returned 33% fewer tokens than the RAG equivalent, keeping per-turn costs low even as the agent explored more.

Better retrieval quality at every level: single-shot, agentic, and per-token. That's what your data gets you when it lives in Wire.

How we ran this

Dataset. 287 files (mostly podcast transcripts, plus product development, engineering, growth anecdotes, user research, and other real-world materials) processed into 13,643 entries in a single Wire container. 64 questions spanning factual recall, multi-step reasoning, and cross-document synthesis.

Four pipelines. RAG (pure vector search), Wire Semantic (Wire's single-shot retrieval), Agent + RAG (agent with vector search tool), and Agent + Wire (agent with full Wire tool suite). All four retrieved from the same dataset.

Answer generation. For single-shot pipelines, retrieved context was passed to Gemini 3 Flash to generate an answer. The model was instructed to decline rather than guess when the context was insufficient. For agentic pipelines, a custom retrieval agent (max 7 turns, encouraged to decompose complex problems and stop when it had enough context) retrieved and answered in the same loop.

Evaluation. A separate Gemini 3 Flash call scored each answer on correctness, completeness, and faithfulness (1 to 5). The evaluator received the question, the expected answer, and the generated answer, but not which pipeline produced it.

Same evaluation for every pipeline. All four pipelines went through identical scoring. The only variable was how context was retrieved.

Frequently asked questions

What is Wire?

Wire is a context-as-a-service platform. You upload files into a container, Wire processes them into AI-optimized context, and any AI tool can access that context through MCP (Model Context Protocol). Think of it as shared, structured memory for your AI tools.

How does Wire retrieval differ from standard RAG?

Standard RAG embeds your documents as chunks and returns the closest vector matches. Wire goes further: it extracts entities, discovers schemas, builds a knowledge graph, and indexes everything for structured and semantic search. When you query a Wire container, retrieval draws on all of that, not just vector similarity.

What types of documents were tested in this benchmark?

We used 287 files producing 13,643 entries. The files include podcast transcripts, product development and engineering notes, growth anecdotes, user research, and other real-world materials. These are documents with conversational language, multiple speakers, and topics that span across files.

Why did cross-document questions score so much higher with an agent?

Cross-document questions require connecting information from multiple sources. Single-pass retrieval returns results from one query, which often misses the second or third piece of the puzzle. An agent using Wire tools can explore the container schema, run multiple searches across up to 7 turns, and iteratively build up the context it needs. That is why cross-document correctness jumped from 1.40 (RAG) to 4.40 (Agent + Wire). Agent + RAG also improved to 4.35, but with 33% more tokens per turn.

Can I run this benchmark on my own data?

Yes. Create a free Wire account, upload your files into a container, and connect your AI tools. You get 3,000 free credits to start, no credit card required.

See it with your own data

Upload a few files and let Wire transform them into AI-ready context.

3,000 free credits. No credit card required.

Create Your First Container