Context Engineering Structured Context RAG (Retrieval-Augmented Generation) Semantic Search

When to process AI context: ingestion, dreaming, query

Jitpal Kocher · June 4, 2026 · 8 min read

Key takeaway

There are three moments where a system can turn raw data into usable AI context: at ingestion when data arrives, in a background pass between requests (sometimes called dreaming or consolidation), and at query time when an agent asks. The discipline of context engineering is matching each kind of work to the right moment: item-derivable work belongs at ingestion, corpus-wide work belongs in the background, and only question-dependent selection belongs at query time. Most systems collapse everything to query time and pay for it on every request.

There are three moments where a system can do the work of turning raw data into context an agent can actually use: when the data arrives (ingestion), in a background pass between requests (sometimes called consolidation, or evocatively, dreaming), and when the agent asks (query time). Most teams use only two of them, and badly. They treat ingestion as “just save the file” and lean on the query path to chunk, rank, reconcile, and assemble everything on demand. The cost shows up on every request, forever.

The discipline of context engineering is not picking one moment. It is matching each kind of work to the moment that fits it. Some work is derivable from a single document and should be done once at ingestion. Some work requires a view of the whole corpus and belongs in a background pass. A small amount genuinely depends on the live question and has to happen at query time. Get the assignment wrong and you either pay repeatedly for work you could have done once, or you push corpus-wide work onto a path that cannot see the corpus.

The three moments, and what belongs in each

Each moment has a kind of work it is uniquely suited to, defined by how much context that work needs and how often it runs. The table below is the whole argument in one view.

Moment	When it runs	What belongs here	Cost of misplacing the work
Ingestion	Once, when a document arrives	Parsing, chunking, embedding, per-item structure, pairwise links	If deferred, repeated on every query
Background (“dreaming”)	Periodically, off the request path	Cross-document entity resolution, canonicalization, aging out stale entries	If forced to query time, too slow; if forced to ingestion, lacks the full corpus
Query	On every request, live	Selecting and ranking which existing entries fit this question	If overloaded, every request stalls and costs more

The unifying rule reads down the middle column. Work that depends only on the item goes to ingestion. Work that depends on the whole corpus goes to the background. Work that depends on the specific question goes to query time. Almost every retrieval problem teams describe as “slow” or “inconsistent” is really a misassignment: corpus work or item work that ended up on the query path because nobody decided otherwise.

Query-time processing is the expensive default

Most systems default to doing the real work at query time, and it is the costliest place to put it. The asymmetry is simple: a document is written once but read many times. If structuring happens on read, you pay for it on every read. A knowledge base queried a thousand times pays parsing and interpretation a thousand times against one upload. There is also a latency tax, because query-time work sits on the critical path of a live request while the agent waits. Anthropic’s guidance on effective context engineering frames context as a scarce resource that must be curated deliberately, and deliberate curation is exactly what you cannot do under the time pressure of a live query.

The token cost is measurable. mem0’s State of AI Agent Memory 2026 reports well-structured retrieval answering queries in roughly 6,800 tokens against about 26,000 for loading full context, nearly a 4x reduction. That saving exists only because the structure being retrieved against was built ahead of time. A system that stores raw text and assembles meaning on the fly has no such structure to retrieve against, so it cannot be that precise. This is the same dynamic behind does AI token usage scale with knowledge base size: without work done ahead of the query, growing the corpus grows the per-query burden instead of holding it flat.

What only ingestion can do

Ingestion is the right home for everything derivable from a document on its own, and it should carry the bulk of the load. When a file arrives, the system can parse it, split it on meaningful boundaries, extract the entities and relationships visible within it, compute embeddings, and write the result as durable structured context. All of that depends only on the document in front of it, so there is no reason to wait. Doing it once at the door means every later query reads ready-made structure instead of re-deriving it, and the cost is paid a single time regardless of how often the document is read.

This is why the instinct to optimize the query path is usually misplaced, the same lesson as where RAG breaks down in production: retrieval can only choose among the representations ingestion produced. Query-time reranking and filtering cannot create structure that was never built. If the relationships, boundaries, and embeddings were not computed at ingestion, no amount of clever query-time logic recovers them. The systems that retrieve well are the ones that did unglamorous work when the data came in, not the ones with the cleverest query path. The same failure shows up in meeting tools: AI notetakers ship the wrong artifact because they stop at a transcript instead of structuring the meeting for the work that comes after it.

The middle ground: a background pass, or “dreaming”

The background pass is for the work that genuinely needs the whole corpus, which neither ingestion nor query time can provide. Some structuring cannot be done from a single document because it depends on relationships across many. Resolving that “J. Smith” in one file and “Jane Smith” in another are the same entity, canonicalizing a vocabulary that drifted across hundreds of documents, re-clustering as the corpus grows, aging out entries that newer ones supersede: all of these require looking at the store as a whole. They cannot run at ingestion, because the document arriving today has not seen the document arriving tomorrow. They should not run at query time, because they are far too expensive to do live. So they run periodically, off the request path. Some teams frame this as the agent “sleeping” and reflecting on what it has accumulated.

The framing is useful but it invites overreach, and this is the part to be careful about. A background pass is the right tool for corpus-wide structure. It is the wrong tool for cleaning up a sloppy write path. If an agent writes the same fact twenty different ways into a flat store and then runs a nightly consolidation to merge the duplicates, the consolidation is undoing damage that should never have been done: deduplication and typed relationships belong at ingestion, not in a recurring cleanup. We argued this at length in when agent memory needs sleep: heavy reliance on a dreaming pass is usually a symptom of an underbuilt ingestion path, not a feature. The background pass earns its place for the genuinely cross-corpus work, and shrinks to almost nothing once ingestion does its job.

Matching work to the moment

The practical test for where any piece of processing belongs is to ask what it depends on. If it depends only on the document, it goes to ingestion. If it depends on the whole corpus and benefits from periodic reconsideration, it goes to the background. If it depends on the live question, it stays at query time. That last category is smaller than most teams assume. Reranking, recency filtering, and final assembly are legitimately query-time jobs because they cannot be known until the question exists. Choosing among already-structured entries is question-dependent. Building the structure those entries are made of is not.

A few situations flip the default, and they are worth naming so the rule does not become dogma. When data changes faster than you can re-ingest it, like live prices or a fast-moving ticket queue, fetching from the source of truth at query time beats serving stale structure. When a corpus is tiny and read rarely, the amortization argument disappears and processing on demand is simply simpler. And when the useful interpretation of a document is entirely query-dependent, some reasoning has to happen live by definition. The mistake is not using query-time processing at all. It is using it for the item work and corpus work that the other two moments were built for, and paying that bill on every request.

What this looks like in practice

In a system that assigns work correctly, the query is small and fast because the heavy lifting already happened upstream. This is the bet behind how Wire keeps agent queries efficient: most of the work happens at ingestion, where uploaded content is parsed, structured, linked, and embedded once, with only a lighter background pass for the cross-document resolution that needs the whole container in view, so a later agent request is a cheap hybrid lookup rather than a re-derivation. The emphasis matters: the goal is to push as much as possible to ingestion and keep the background pass narrow, not to lean on a dreaming step to rescue a thin write path.

So when retrieval feels slow, expensive, or inconsistent, resist the urge to optimize the query. The query is usually slow because it is doing work that belonged to one of the earlier moments. Move item work to ingestion, move corpus work to the background, and let the query be what it should have been all along: a cheap selection over context that three stages, used deliberately, already made usable.

Sources: State of AI Agent Memory 2026 (mem0) · Effective context engineering for AI agents (Anthropic) · Memory for Autonomous LLM Agents (arXiv:2603.07670)

Frequently asked questions

What are the stages of processing context for an AI agent?

There are three: ingestion-time processing when data first arrives, a background or offline pass that runs between requests, and query-time processing when the agent asks a question. Each stage suits a different kind of work, and the cost and consistency of retrieval depend on doing each kind in the right place.

What is dreaming or sleep-time compute for AI memory?

Dreaming refers to a background pass where an agent reprocesses its accumulated memory between sessions to consolidate, deduplicate, or reorganize it. It runs off the live request path, so it can afford a corpus-wide view that query-time processing cannot. It is best reserved for work that genuinely needs the whole store rather than as a substitute for structuring data at ingestion.

Should context be processed at ingestion or in a background pass?

Prefer ingestion for anything derivable from a document on its own, because that work is paid once per document and keeps queries cheap. Reserve the background pass for work that requires comparing across the entire corpus, like resolving entities that appear in many documents. Leaning heavily on the background pass usually signals an underbuilt ingestion path.

Why is query-time processing more expensive than ingestion?

Query-time processing repeats work on every request, while ingestion pays it once per document. A document read a thousand times incurs structuring a thousand times at query time versus once at ingestion, and the query also stalls while the work runs on the live path.

RAG (Retrieval-Augmented Generation) Knowledge Graph

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container

When to process AI context: ingestion, dreaming, query

The three moments, and what belongs in each

Query-time processing is the expensive default

What only ingestion can do

The middle ground: a background pass, or “dreaming”

Matching work to the moment

What this looks like in practice

Frequently asked questions

Related articles

Knowledge graphs vs RAG: when graphs actually win

Chunking Strategies Decide What Context AI Sees

7 context engineering techniques for production

Ready to give your AI agents better context?