When agent memory needs sleep, and when it doesn't
Key takeaway
Good context for AI agents meets five design criteria: relevance, sufficiency, isolation, economy, and provenance. The framework comes from a March 2026 paper that treats context as the agent's operating system and arranges context engineering as one rung in a four-discipline maturity model. The criteria are cumulative: economy and provenance only pay off once relevance, sufficiency, and isolation already hold. Most agent failures trace to a missing criterion rather than a weak model, which is why larger context windows rarely fix them.
Good context for AI agents meets five design criteria: relevance, sufficiency, isolation, economy, and provenance. The framework comes from Context Engineering: From Prompts to Corporate Multi-Agent Architecture, a March 2026 paper by Vera V. Vishnyakova that argues prompt wording stops being the bottleneck once an AI system becomes an autonomous agent rather than a stateless chatbot. What matters then is the entire informational environment the agent reasons inside, and these five criteria are how you judge whether that environment is well built.
The criteria are useful because they turn a vague goal (“give the agent good context”) into five separable questions you can design against and check one at a time. They also explain a pattern most teams hit in production: an agent fails, the model gets blamed, a bigger model or a longer window gets swapped in, and nothing improves, because the actual defect was a missing criterion the model can do nothing about.
Context engineering has graduated from a loose practice into a discipline with its own quality bar. The paper places it as the second rung of a four-discipline maturity model, prompt engineering, then context engineering, then intent engineering, then specification engineering, each building cumulatively on the one below. Its central claim is that “context as the agent’s operating system” is the right mental model: the context window is not a prompt, it is the runtime environment in which every decision is made, and whoever controls that environment controls the agent’s behavior.
That reframing matters because it changes what counts as a fix. If context is just a prompt, you improve it by writing better instructions. If context is an operating system, you improve it by engineering what gets loaded, when, in what structure, and with what provenance, which is exactly what moving beyond prompt engineering means in practice. The five criteria are the spec sheet for that environment.
Each criterion answers a distinct question, and a context window can pass one while failing another. The table below summarizes what each one governs and the failure mode you see when it is missing.
| Criterion | The question it answers | Failure mode when missing |
|---|---|---|
| Relevance | Is this information appropriate to the decision at hand? | The model attends to off-topic content and drifts |
| Sufficiency | Is there enough information to complete the task? | The model guesses or hallucinates to fill gaps |
| Isolation | Is conflicting or extraneous data kept out? | Cross-talk and contradictions corrupt reasoning |
| Economy | Is the information structured efficiently? | Token bloat, slower inference, higher cost |
| Provenance | Can each fact be traced to a source? | The agent can’t weigh or verify what it was given |
Read together, the criteria are an ordered set, not a flat checklist. Relevance, sufficiency, and isolation determine whether the context is correct at all. Economy and provenance determine whether a correct context is also efficient and trustworthy. Optimizing economy before relevance holds is how teams end up with a beautifully compressed window full of the wrong things.
Relevance is the first criterion because irrelevant context is not neutral, it is harmful. Transformer attention is finite and gets diluted across whatever you load, so every off-topic token competes with the tokens that actually matter. This is the mechanism behind context rot: accuracy degrades not when the window fills but when the share of relevant content drops, often well before any size limit.
Relevance is also the criterion most damaged by the instinct to “add more just in case.” A larger context window tempts teams to include marginally related documents, which lowers the relevance ratio and makes the agent worse. Designing for relevance means retrieving narrowly and deliberately for the current step, not preloading everything the agent might conceivably need.
Sufficiency is the mirror image of relevance: the context must contain everything the task genuinely requires, or the model fills the gap by guessing. Under-provisioned context is one of the most common roots of hallucination, because a model asked a question it lacks the grounding to answer will still answer. The fix is not a bigger model but the missing fact placed in front of it.
Sufficiency and relevance pull in opposite directions, and resolving that tension is the core of context design. Too little and the agent invents; too much and relevance collapses. The target is the smallest set of context that is still complete for the task, which is why sufficiency can only be judged per task, not as a global setting.
Isolation is the criterion that separates a clean working context from a polluted one, and it is the criterion that breaks first in multi-agent systems. When agents share a context channel, one agent’s intermediate output becomes another’s input, contradictions accumulate, and no single agent has the authority to reconcile them. Sub-agent context isolation, where each agent operates on a scoped slice rather than a shared pool, is the structural answer.
An agent connected to a Wire container sees one container’s structured entries rather than the union of every source it has ever touched, so the isolation criterion holds by construction rather than by prompt discipline. That is the difference between hoping an agent ignores irrelevant state and guaranteeing the state was never in its window to begin with.
Isolation also covers temporal conflicts: a stale fact and its updated replacement sitting in the same window is an isolation failure, and it produces the kind of contradiction that looks like a reasoning bug but is really a context bug.
Economy is about information density, not deletion: the same facts expressed in fewer tokens. It earns its place as a criterion because token bloat has compounding costs, slower inference, higher bills, and a lower relevance ratio, all at once. The lever is structure, which is why structured context consistently outperforms raw text dumps for the same underlying information.
Economy is best enforced at ingestion rather than query time. Processing documents into compact, pre-structured entries once, up front, means the agent reads dense context on every later query instead of re-parsing raw files each time. This is the same logic behind allocating a token budget per context source and behind context compression: spend tokens where they buy the most decision value, and stop spending them on format and redundancy.
Provenance is the criterion that makes the other four auditable, because an agent that cannot trace where a fact came from cannot weigh it, verify it, or decide whether it is still true. The paper lists provenance as a first-class quality criterion, not an afterthought, and that placement matters: provenance is structural metadata the agent uses at decision time, not a compliance log bolted on afterward. We have argued the same point at length in provenance is a context engineering primitive.
In practice, epistemic provenance means tagging each piece of context with its source, position, recency, and relationships so the agent can reason about reliability the way a careful human would. Without it, every fact in the window has equal apparent authority, which is exactly how a single bad input becomes treated as ground truth.
The five criteria gain their power from being applied in order, the same way the paper’s maturity model stacks disciplines rather than listing them. Relevance and sufficiency define whether the context is correct, isolation protects that correctness from contamination, and only then do economy and provenance make a correct context efficient and trustworthy. Skip ahead and you optimize the wrong layer: a perfectly economical context window full of irrelevant, unsourced data is worse than a verbose one that gets the basics right.
This ordering is also a debugging tool. When an agent misbehaves, walk the criteria from the top: was the right information present, was enough of it present, was conflicting data kept out, was it dense, was it sourced. The first criterion that fails is usually the real defect, and it is almost never the one teams reach for first, which is model size.
Treat the criteria as design constraints at ingestion, not as runtime patches. Most of the work happens before the agent ever runs a query: deciding what to retrieve for relevance, ensuring task-complete coverage for sufficiency, scoping per-agent context for isolation, structuring entries for economy, and tagging sources for provenance. Doing this work at query time, document by document on every call, is what makes agents slow, expensive, and brittle.
The design criteria also pair directly with measurement. Once you have built context to satisfy the five, you can score it against operational metrics, and the companion practice of measuring context quality covers correctness, completeness, faithfulness, relevance, and freshness on the context you ship. Design for the five criteria, measure against the metrics, and you close the loop that most teams leave open: they observe their agents without ever evaluating what those agents were given to work with.
Sources: Context Engineering: From Prompts to Corporate Multi-Agent Architecture (arXiv 2603.09619) · Deloitte, State of Generative AI in the Enterprise 2026
Related
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container