How to measure context quality for AI agents

LangChain’s State of Agent Engineering report surveyed over 1,300 professionals and found that 89% of teams have observability for their AI agents. Only 52% have evaluations. That gap matters because most teams are watching what their agents do without measuring what their agents receive. They track latency, token usage, and task completion. They rarely measure whether the context itself was any good.

This is a problem because context quality is the single largest lever on agent performance. Research on maximum effective context windows found that some top models fail with as few as 100 tokens of wrong context. Not 100,000 tokens. One hundred. The context window isn’t a bucket you fill. It’s a signal channel where noise has real costs.

When 32% of teams cite quality as the top barrier to putting agents in production, the question shouldn’t be “is our model good enough?” It should be “is our context good enough?” Here’s how to answer that.

Five dimensions of context quality

Context quality isn’t a single number. It’s a composite of five measurable dimensions, each capturing a different way context can fail. Miss any one of them and your agent’s output degrades, even if the other four are perfect.

1. Correctness

Does the retrieved context contain the right information to answer the query?

This is the most intuitive dimension. If someone asks “what’s our refund policy?” and the retrieval pipeline returns the shipping policy instead, the context is incorrect regardless of how well-structured or concise it is.

Correctness is typically scored on a 1-5 scale by comparing the agent’s answer against a known-good reference answer. The score reflects whether the retrieved context contained the information needed to produce a correct response.

In a retrieval benchmark we ran across 64 questions and 13,643 entries from 287 real-world files, correctness was where context quality differences showed up most clearly. Standard RAG scored 2.98/5 on correctness. Wire’s semantic retrieval scored 3.73/5 on the same questions, from the same dataset, using roughly the same token budget (~3,500 tokens per query). Same model, same data, same token count. The only variable was retrieval quality. That 25% gap is entirely a context quality gap.

The gap widened further by question type. On factual recall, RAG scored 3.36 while Wire semantic search scored 4.50. On cross-document synthesis, where the answer requires connecting information from multiple sources, RAG dropped to 1.40 and Wire managed 1.80. Single-pass retrieval hits a ceiling when completeness requires pulling from multiple places.

2. Completeness

Does the context cover all the aspects needed to fully answer the query?

A context passage can be correct but incomplete. If someone asks “compare our Q1 and Q2 revenue,” and retrieval returns only the Q1 figures, the context is correct (what it says is true) but incomplete (it’s missing half the answer).

Completeness is measured by comparing the agent’s response against all the key aspects of a reference answer. Did the response cover every relevant point, or did it miss pieces because the retrieval pipeline didn’t surface them?

This is the dimension most affected by retrieval breadth. In our benchmark, completeness scores followed the same pattern as correctness: RAG scored 2.56/5 versus 3.22/5 for Wire semantic search. But the more revealing finding came from the agentic tests. When an AI agent was given the ability to make multiple retrieval calls (up to 7 turns), completeness jumped to 3.97 for Agent + RAG and 4.05 for Agent + Wire. Agents compensate for single-pass incompleteness by iterating: querying, reviewing what’s missing, and querying again.

This suggests a practical measurement insight: if your completeness scores improve significantly when you let an agent iterate, your single-pass retrieval is leaving information on the table.

3. Faithfulness

Did the model use the retrieved context, or did it draw from its training data?

Faithfulness (sometimes called groundedness) is a grounding check, not an accuracy check. A faithfulness score of 5/5 means the model only used what was retrieved, even if what was retrieved was wrong. This dimension matters because unfaithful responses mask context quality problems. If the model silently fills in gaps from its training data, you won’t notice that your retrieval pipeline missed critical information until the training data is wrong about something.

In our benchmark, both RAG and Wire semantic search scored near-perfect faithfulness (5.00 and 4.97). This is expected when you instruct the model to answer strictly from provided context and accept “I don’t know” as a valid response. The interesting signal is in the “declined to answer” counts: RAG declined on 23 of 64 questions while Wire declined on only 13. When faithfulness is enforced, retrieval quality shows up as coverage: better context means fewer questions the model has to punt on.

4. Relevance

What is the signal-to-noise ratio in the retrieved context?

Relevance measures how much of the retrieved context actually contributes to answering the query versus how much is filler, near-duplicates, or tangentially related passages.

This dimension is harder to score directly, but its effects are well-documented. Research on context rot shows that model accuracy drops from 95% to 60-70% as input length grows, even on simple retrieval tasks. The mechanism is attention dilution: irrelevant context competes for the model’s finite attention budget, pushing useful information out of focus. The lost-in-the-middle effect compounds this, as models attend disproportionately to the beginning and end of their context, letting relevant information in the middle get overlooked.

You can measure relevance indirectly through two proxies. First, compare performance at different retrieval depths (top-3 vs. top-10 vs. top-20 chunks). If scores drop as you retrieve more, you’re adding noise faster than signal. Second, measure token efficiency: the ratio of quality score to tokens consumed. In our benchmark, RAG used 3,596 tokens per query for a correctness of 2.98, while Wire semantic search used 3,468 tokens for 3.73. That’s 25% better correctness from 4% fewer tokens, a direct measure of higher relevance per token.

5. Freshness

Is the context current enough for the query?

Freshness is a temporal dimension that most evaluation frameworks ignore. If someone asks about a customer’s current subscription tier and the retrieval pipeline returns a record from six months ago, the context may be correct relative to that snapshot but wrong relative to the present.

Freshness matters most for operational queries (current status, recent changes, live data) and least for reference queries (how does X work, what does this term mean). The practical challenge is that freshness isn’t a property of the retrieval algorithm. It’s a property of the context pipeline: how frequently data is ingested, how stale records are handled, and whether the system tracks when information was last updated.

You can measure freshness by including time-sensitive questions in your evaluation set and tracking whether the system returns current data or stale snapshots. The metric is the percentage of time-sensitive queries that return data within an acceptable recency window.

How to measure: three approaches

LLM-as-judge

The most scalable approach. A separate model (typically a frontier model) evaluates the agent’s output against a reference answer across the dimensions above. LangChain’s report found that 53.3% of teams use LLM-as-judge for evaluation. The trade-off is consistency: LLM judges can be biased toward verbose answers or penalize correct but concise responses. Calibrate with human review on a sample.

Human evaluation

The gold standard but the least scalable. 59.8% of teams still use human review for high-stakes evaluations. Best used for calibrating automated metrics, catching edge cases, and evaluating dimensions like relevance that are harder to automate. A practical pattern is to human-evaluate 10-20% of your test set and use those scores to validate your LLM-as-judge alignment.

Retrieval metrics

Traditional information retrieval metrics like precision@k and recall@k measure retrieval quality directly, before the model generates a response. Precision@k tells you what fraction of retrieved chunks were relevant. Recall@k tells you what fraction of all relevant chunks were retrieved. These are useful for tuning your retrieval pipeline independently of the generation model, but they don’t capture downstream effects on answer quality.

The most practical approach combines all three: automated retrieval metrics for fast iteration on the pipeline, LLM-as-judge for broad coverage, and human evaluation for calibration.

Building a context quality baseline

You don’t need a massive evaluation framework to start measuring. Here’s a minimal approach.

Create a question set. Start with 30-50 questions at known difficulty levels: easy (single fact, single source), medium (multi-step reasoning, single source), and hard (cross-document synthesis). Write reference answers for each one. In our benchmark, we used 64 questions across three tiers and found that the difficulty distribution revealed more about retrieval quality than any single aggregate score.

Score on the five dimensions. For each question, run your retrieval pipeline, generate an answer, and score correctness, completeness, faithfulness, relevance (via token efficiency), and freshness (for time-sensitive questions). Use an LLM-as-judge with a scoring rubric. Even a 1-5 scale with clear anchors for each level produces useful signal.

Measure before and after. The value of context quality metrics isn’t the absolute score, it’s the delta. Score your baseline, make a pipeline change (different chunking strategy, better embedding model, semantic search), and score again. Our benchmark found that switching from standard RAG to Wire’s retrieval (which combines entity extraction, schema discovery, and intelligent indexing behind a single semantic search call) improved correctness by 25% without changing the model or the token budget. Without measurement, that improvement would have been invisible.

Track degradation. Context quality isn’t static. As your data grows, ages, or shifts, retrieval quality can degrade silently. Rerun your evaluation set monthly and watch for downward trends, especially in completeness and freshness. This is context rot at the pipeline level: your system works today but slowly gets worse as the world changes and your context doesn’t.

What the benchmarks keep showing

Across our retrieval benchmarks and the broader research, a consistent pattern emerges: the ceiling on AI agent performance is set by context quality, not model capability. The same model produces wildly different results depending on what context it receives. Improving context, whether through better structure, smarter retrieval, or iterative agent workflows, consistently outperforms model upgrades.

But you can’t improve what you don’t measure. Teams that treat context as a black box and only evaluate model outputs are debugging the wrong layer. Define your quality dimensions, build a question set, score your baseline, and track changes over time. The gap between “it seems to work” and “we know it works, and here’s why” is a measurement problem. And it’s solvable.

References

LangChain, State of Agent Engineering 2026
Paulsen, N., Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs, arXiv 2509.21361
Anthropic, Demystifying Evals for AI Agents
AWS, Evaluating AI Agents: Real-World Lessons from Building Agentic Systems at Amazon
Microsoft, AI Agent Performance Measurement: Redefining Excellence
Liu et al., Lost in the Middle: How Language Models Use Long Contexts
Wire, Retrieval Benchmarks