7 context engineering techniques for production
Key takeaway
There are three moments where a system can turn raw data into usable AI context: at ingestion when data arrives, in a background pass between requests (sometimes called dreaming or consolidation), and at query time when an agent asks. The discipline of context engineering is matching each kind of work to the right moment: item-derivable work belongs at ingestion, corpus-wide work belongs in the background, and only question-dependent selection belongs at query time. Most systems collapse everything to query time and pay for it on every request.
There are three moments where a system can do the work of turning raw data into context an agent can actually use: when the data arrives (ingestion), in a background pass between requests (sometimes called consolidation, or evocatively, dreaming), and when the agent asks (query time). Most teams use only two of them, and badly. They treat ingestion as “just save the file” and lean on the query path to chunk, rank, reconcile, and assemble everything on demand. The cost shows up on every request, forever.
The discipline of context engineering is not picking one moment. It is matching each kind of work to the moment that fits it. Some work is derivable from a single document and should be done once at ingestion. Some work requires a view of the whole corpus and belongs in a background pass. A small amount genuinely depends on the live question and has to happen at query time. Get the assignment wrong and you either pay repeatedly for work you could have done once, or you push corpus-wide work onto a path that cannot see the corpus.
Each moment has a kind of work it is uniquely suited to, defined by how much context that work needs and how often it runs. The table below is the whole argument in one view.
| Moment | When it runs | What belongs here | Cost of misplacing the work |
|---|---|---|---|
| Ingestion | Once, when a document arrives | Parsing, chunking, embedding, per-item structure, pairwise links | If deferred, repeated on every query |
| Background (“dreaming”) | Periodically, off the request path | Cross-document entity resolution, canonicalization, aging out stale entries | If forced to query time, too slow; if forced to ingestion, lacks the full corpus |
| Query | On every request, live | Selecting and ranking which existing entries fit this question | If overloaded, every request stalls and costs more |
The unifying rule reads down the middle column. Work that depends only on the item goes to ingestion. Work that depends on the whole corpus goes to the background. Work that depends on the specific question goes to query time. Almost every retrieval problem teams describe as “slow” or “inconsistent” is really a misassignment: corpus work or item work that ended up on the query path because nobody decided otherwise.
Most systems default to doing the real work at query time, and it is the costliest place to put it. The asymmetry is simple: a document is written once but read many times. If structuring happens on read, you pay for it on every read. A knowledge base queried a thousand times pays parsing and interpretation a thousand times against one upload. There is also a latency tax, because query-time work sits on the critical path of a live request while the agent waits. Anthropic’s guidance on effective context engineering frames context as a scarce resource that must be curated deliberately, and deliberate curation is exactly what you cannot do under the time pressure of a live query.
The token cost is measurable. mem0’s State of AI Agent Memory 2026 reports well-structured retrieval answering queries in roughly 6,800 tokens against about 26,000 for loading full context, nearly a 4x reduction. That saving exists only because the structure being retrieved against was built ahead of time. A system that stores raw text and assembles meaning on the fly has no such structure to retrieve against, so it cannot be that precise. This is the same dynamic behind does AI token usage scale with knowledge base size: without work done ahead of the query, growing the corpus grows the per-query burden instead of holding it flat.
Ingestion is the right home for everything derivable from a document on its own, and it should carry the bulk of the load. When a file arrives, the system can parse it, split it on meaningful boundaries, extract the entities and relationships visible within it, compute embeddings, and write the result as durable structured context. All of that depends only on the document in front of it, so there is no reason to wait. Doing it once at the door means every later query reads ready-made structure instead of re-deriving it, and the cost is paid a single time regardless of how often the document is read.
This is why the instinct to optimize the query path is usually misplaced, the same lesson as RAG is not enough when retrieval fails: retrieval can only choose among the representations ingestion produced. Query-time reranking and filtering cannot create structure that was never built. If the relationships, boundaries, and embeddings were not computed at ingestion, no amount of clever query-time logic recovers them. The systems that retrieve well are the ones that did unglamorous work when the data came in, not the ones with the cleverest query path.
The background pass is for the work that genuinely needs the whole corpus, which neither ingestion nor query time can provide. Some structuring cannot be done from a single document because it depends on relationships across many. Resolving that “J. Smith” in one file and “Jane Smith” in another are the same entity, canonicalizing a vocabulary that drifted across hundreds of documents, re-clustering as the corpus grows, aging out entries that newer ones supersede: all of these require looking at the store as a whole. They cannot run at ingestion, because the document arriving today has not seen the document arriving tomorrow. They should not run at query time, because they are far too expensive to do live. So they run periodically, off the request path. Some teams frame this as the agent “sleeping” and reflecting on what it has accumulated.
The framing is useful but it invites overreach, and this is the part to be careful about. A background pass is the right tool for corpus-wide structure. It is the wrong tool for cleaning up a sloppy write path. If an agent writes the same fact twenty different ways into a flat store and then runs a nightly consolidation to merge the duplicates, the consolidation is undoing damage that should never have been done: deduplication and typed relationships belong at ingestion, not in a recurring cleanup. We argued this at length in when agent memory needs sleep: heavy reliance on a dreaming pass is usually a symptom of an underbuilt ingestion path, not a feature. The background pass earns its place for the genuinely cross-corpus work, and shrinks to almost nothing once ingestion does its job.
The practical test for where any piece of processing belongs is to ask what it depends on. If it depends only on the document, it goes to ingestion. If it depends on the whole corpus and benefits from periodic reconsideration, it goes to the background. If it depends on the live question, it stays at query time. That last category is smaller than most teams assume. Reranking, recency filtering, and final assembly are legitimately query-time jobs because they cannot be known until the question exists. Choosing among already-structured entries is question-dependent. Building the structure those entries are made of is not.
A few situations flip the default, and they are worth naming so the rule does not become dogma. When data changes faster than you can re-ingest it, like live prices or a fast-moving ticket queue, fetching from the source of truth at query time beats serving stale structure. When a corpus is tiny and read rarely, the amortization argument disappears and processing on demand is simply simpler. And when the useful interpretation of a document is entirely query-dependent, some reasoning has to happen live by definition. The mistake is not using query-time processing at all. It is using it for the item work and corpus work that the other two moments were built for, and paying that bill on every request.
In a system that assigns work correctly, the query is small and fast because the heavy lifting already happened upstream. This is the bet behind how Wire keeps agent queries efficient: most of the work happens at ingestion, where uploaded content is parsed, structured, linked, and embedded once, with only a lighter background pass for the cross-document resolution that needs the whole container in view, so a later agent request is a cheap hybrid lookup rather than a re-derivation. The emphasis matters: the goal is to push as much as possible to ingestion and keep the background pass narrow, not to lean on a dreaming step to rescue a thin write path.
So when retrieval feels slow, expensive, or inconsistent, resist the urge to optimize the query. The query is usually slow because it is doing work that belonged to one of the earlier moments. Move item work to ingestion, move corpus work to the background, and let the query be what it should have been all along: a cheap selection over context that three stages, used deliberately, already made usable.
Sources: State of AI Agent Memory 2026 (mem0) · Effective context engineering for AI agents (Anthropic) · Memory for Autonomous LLM Agents (arXiv:2603.07670)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container