What we tested
Each retrieval pipeline was given a question, retrieved context from the same dataset, and a non-frontier model (2026) generated an answer strictly from the retrieved context. An independent evaluator then scored the answer on three dimensions:
Does the answer contain the right information?
Does it cover all key aspects of the expected answer?
Was the answer generated from the retrieved context, not from the model's training data? This is a grounding check, not an accuracy check. A score of 5 means the model only used what was retrieved, even if what was retrieved was wrong.
Answered vs. declined: Both pipelines always returned context from the retrieval step. But when the returned context wasn't relevant enough to answer the question, the model correctly declined rather than guessing. "Declined to answer" means the retrieval pipeline returned results, but nothing useful. These are scored 0 correctness, 0 completeness, and 5 faithfulness. The difference between "answered" counts shows where one pipeline returned relevant context and the other didn't.
The approaches
RAG (Vector Search)
Standard retrieval-augmented generation. Embed the query, return the closest matching chunks. This is what most AI systems do today.
Wire Semantic Search
Wire's retrieval, accessed through a single wire_search tool call. The container's architecture (entity extraction, schema discovery, and intelligent indexing) handles the retrieval.
Results: RAG vs Wire Semantic
Overall scores across all 64 questions.
| RAG | Wire Semantic | |
|---|---|---|
| Correctness | 2.98 / 5 | 3.73 / 5 |
| Completeness | 2.56 / 5 | 3.22 / 5 |
| Faithfulness | 5.00 / 5 | 4.97 / 5 |
| Answered from context | 41 / 64 | 51 / 64 |
| Declined to answer | 23 / 64 | 13 / 64 |
| Avg context tokens | 3,596 | 3,468 |
Both pipelines scored near-perfect faithfulness (5.00 and 4.97). The model wasn't using training data in either case. The difference is entirely in retrieval quality.
Both pipelines returned roughly the same volume of context per query (~3,500 tokens). But Wire's context was more relevant. 25% better correctness from the same token budget. More signal per token.
Results by question type
Why cross-document questions are hard
Cross-document questions require connecting information across multiple sources. A single retrieval call returns results for one query, which often misses the second or third piece of the puzzle. Both pipelines struggled here. RAG answered 8 of 20, Wire answered 10 of 20. Single-pass retrieval, whether RAG or Wire Semantic, has fundamental limits when the answer lives across multiple documents. This is where agentic retrieval changes the picture.
Token efficiency
Wire returns the same amount of context as RAG. The difference is what's in it.
Both pipelines return roughly 3,500 tokens of context per query. The token volume is nearly identical. But Wire's context produces 25% better answers because the tokens themselves are more relevant.
RAG retrieves the closest vector matches, which often include loosely related passages that dilute the useful information. Wire's container architecture (entity extraction, schema discovery, intelligent indexing) surfaces more precisely targeted context. Same token budget, more signal per token.
For teams monitoring token costs, this matters. You're not paying for more context. You're getting better context.
What happens when an agent uses Wire
Wire containers are designed for AI agents. Here's what happens when one actually uses all the tools.
We gave an AI agent access to the full suite of Wire container tools: wire_explore for schema discovery and wire_search for all retrieval. The agent explored the container, queried data across multiple turns, and iterated based on what it found. The results come from every Wire tool working in concert, not any single tool in isolation.
The agent: A custom retrieval agent with a maximum of 7 turns. It was instructed with the available tools, encouraged to break down complex problems into smaller queries, and encouraged to stop when it believed it had enough context to answer. The same agent was used for both Agent + RAG and Agent + Wire, the only difference was which tools it had access to.
| Single-shot retrieval | Agentic (up to 7 turns) | |||
|---|---|---|---|---|
| RAG | Wire Semantic | Agent + RAG | Agent + Wire | |
| Correctness | 2.98 | 3.73 | 4.53 | 4.78 |
| Completeness | 2.56 | 3.22 | 3.97 | 4.05 |
| Answered | 41 / 64 | 51 / 64 | 60 / 64 | 62 / 64 |
| Avg tokens per turn | n/a | n/a | 4,369 | 2,918 |
Agent + Wire answered 97% of all questions. Single-pass RAG managed 64%. The agent used an average of 5.5 tool calls per query, exploring the schema, searching structured and unstructured data, and iterating based on what it found.
Cross-document correctness jumped from 1.40 (RAG) to 4.40 (Agent + Wire). Questions that require connecting ideas across multiple sources went from mostly unanswerable to near-perfect.
What this means
Wire containers are the best place to put data that your AI tools need to access. Upload your files once, share them with your team, and every AI tool connected to Wire gets better answers.
Same 13,643 entries. Same 64 questions. In single-shot retrieval, Wire delivered 25% better correctness from roughly the same token budget. When an agent had access to the full container, it answered 62 of 64 questions, including cross-document questions that single-pass retrieval couldn't touch. And each tool call returned 33% fewer tokens than the RAG equivalent, keeping per-turn costs low even as the agent explored more.
Better retrieval quality at every level: single-shot, agentic, and per-token. That's what your data gets you when it lives in Wire.
How we ran this
Dataset. 287 files (mostly podcast transcripts, plus product development, engineering, growth anecdotes, user research, and other real-world materials) processed into 13,643 entries in a single Wire container. 64 questions spanning factual recall, multi-step reasoning, and cross-document synthesis.
Four pipelines. RAG (pure vector search), Wire Semantic (Wire's single-shot retrieval), Agent + RAG (agent with vector search tool), and Agent + Wire (agent with full Wire tool suite). All four retrieved from the same dataset.
Answer generation. For single-shot pipelines, retrieved context was passed to Gemini 3 Flash to generate an answer. The model was instructed to decline rather than guess when the context was insufficient. For agentic pipelines, a custom retrieval agent (max 7 turns, encouraged to decompose complex problems and stop when it had enough context) retrieved and answered in the same loop.
Evaluation. A separate Gemini 3 Flash call scored each answer on correctness, completeness, and faithfulness (1 to 5). The evaluator received the question, the expected answer, and the generated answer, but not which pipeline produced it.
Same evaluation for every pipeline. All four pipelines went through identical scoring. The only variable was how context was retrieved.
Frequently asked questions
What is Wire?
How does Wire retrieval differ from standard RAG?
What types of documents were tested in this benchmark?
Why did cross-document questions score so much higher with an agent?
Can I run this benchmark on my own data?
See it with your own data
Upload a few files and let Wire transform them into AI-ready context.
3,000 free credits. No credit card required.
Create Your First Container