How to measure context quality for AI agents
Roughly 70% of retrieved passages don’t directly contain the information needed to answer a query, even when the retrieval system returns results that look relevant. The search “worked” in the sense that it returned documents. But the context it delivered wasn’t useful.
The reason usually has less to do with the database or the index and more to do with how the search understands the query in the first place. Most AI retrieval systems still rely on approaches that match words rather than meaning. That gap between what the user means and what the search matches is where context quality breaks down.
Traditional keyword search, powered by algorithms like BM25 and TF-IDF, works by matching terms in the query against terms in the document. It counts how often a word appears, how rare it is across the corpus, and ranks results accordingly. This approach is fast, predictable, and works well when the user and the document use the same vocabulary.
The problem is that they often don’t.
Search for “customer churn prediction” and keyword search misses documents about “attrition forecasting” or “retention modeling,” even though they’re about the same thing. Search for “how to reduce API latency” and you might miss a document titled “optimizing request throughput” that contains exactly the answer you need.
For AI agents, this mismatch is worse than it is for humans. When a person gets imperfect search results, they can scan, interpret, and reformulate. When an agent gets imperfect results, those results become the context for its next action. Bad retrieval becomes bad context, and bad context becomes a bad output. This is the mechanism behind many AI hallucinations: the model generates a plausible answer from incomplete or irrelevant context rather than admitting it doesn’t have the information.
Semantic search takes a fundamentally different approach. Instead of matching words, it matches meaning.
The core mechanism is vector embeddings. An embedding model (a neural network trained on large text corpora) converts text into a high-dimensional vector: a list of numbers that represents the text’s meaning in mathematical space. Similar meanings cluster together. “Customer churn prediction” and “attrition forecasting” end up as nearby points, even though they share no words.
When you run a semantic search, the system converts your query into a vector, then finds the document vectors closest to it using a distance metric like cosine similarity. The result is a ranked list of passages ordered by conceptual relevance, not keyword overlap.
The embedding model landscape has matured rapidly. The MTEB benchmark now tracks over 2,000 embedding models across standardized evaluation tasks. Google’s Gemini Embedding leads the English leaderboard, with NVIDIA’s open-weight models close behind. The top five models by raw score are now all either open-weight or very cheap to use, a shift from a year ago when commercial APIs dominated.
Top embedding models can improve retrieval accuracy by 10-30% compared to older or smaller models. That improvement flows directly into better context for the LLM, fewer irrelevant results, and more reliable RAG performance.
We tested this directly. In Wire’s retrieval benchmarks, we compared standard RAG (vector search over embedded chunks) against Wire’s semantic search across 64 questions and 13,643 entries from 287 real-world files.
The results:
The gap widened on factual recall questions. Wire scored 4.50 vs RAG’s 3.36 and answered 20 out of 22 questions compared to RAG’s 15. These are the queries where finding the right passage matters most, and where keyword-dependent retrieval is most likely to miss.
Both approaches struggled with cross-document synthesis (connecting information across multiple files), scoring 1.80 and 1.40 respectively on single-pass retrieval. But when an agent could make multiple search calls iteratively, correctness jumped to 4.40. The search quality still mattered: the agent needed good results on each individual query to build up the full picture.
The takeaway: better search produces better context, and better context produces better answers. The model was the same in every test. The only variable was what context the retrieval delivered.
Semantic search has a specific blind spot: exact matches.
Ask a semantic search system for error code ERR_SSL_PROTOCOL_ERROR and it might return results about SSL configuration or TLS troubleshooting, which are conceptually related but miss the specific error code you need. Search for API endpoint /v2/containers/{id}/entries and you’ll get results about container management generally, not the specific endpoint documentation.
This is where keyword search excels. It matches exactly what you type. For product names, error codes, configuration keys, version numbers, and technical identifiers, keyword matching is more reliable than semantic similarity.
The production solution is hybrid search: running both approaches simultaneously and combining the results. Meilisearch benchmarks show hybrid search improves retrieval accuracy by up to 37% in technical domains, 41.2% for developer assistants, and 33.4% for legal and compliance queries. 63% of new enterprise RAG systems now use hybrid search, up from 28% in early 2023.
The pattern is consistent: semantic search handles intent and conceptual queries, keyword search handles precision lookups, and the fusion of both outperforms either alone.
Search strategy is a context engineering decision. It determines what context reaches your model, which determines output quality. The same LLM with the same prompt produces different results depending on whether the retrieval surfaced the right passages.
A few practical guidelines:
Audit your retrieval, not just your prompts. Look at what context actually reaches the model for real queries. If relevant information exists in your corpus but isn’t being retrieved, the problem is search, not the model.
Use hybrid search in production. Pure semantic search misses exact matches. Pure keyword search misses conceptual connections. The combination handles both. Most vector databases (Pinecone, Weaviate, Qdrant) now support hybrid modes natively.
Test with agent-generated queries. Agents don’t search like humans. They generate natural-language queries that are often more verbose and conceptual than what a person would type. Your search system needs to handle that query style well.
Treat retrieval as the first filter in your context pipeline. Everything downstream, from context rot to attention dilution to hallucination, gets worse when retrieval delivers low-quality context. Improving search is often the highest-leverage change you can make.
Wire containers use semantic search with entity extraction and schema discovery to go beyond raw vector similarity. The benchmarks show the difference that makes on real-world data. But regardless of tooling, the principle holds: the quality of your search determines the quality of your context, and the quality of your context determines the quality of your AI.
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container