ai-agent-security context-engineering rag multi-agent-systems context-poisoning

Context Poisoning: When Bad Data Becomes AI Ground Truth

Jitpal Kocher · April 10, 2026 · 10 min read

Key takeaway

Context poisoning is an attack in which malicious or false information is planted into an AI agent's memory, RAG index, or tool outputs so the model treats it as ground truth. Unlike prompt injection, which ends when a session closes, poisoning persists across sessions and can corrupt every downstream action. The root cause is not model weakness but weak context engineering: pipelines that accept untrusted input without provenance tracking, source isolation, or validation. Fixing it requires treating every piece of context as untrusted until it has been attributed, scoped, and checked.

In March 2026, OWASP released its Top 10 for Agentic Applications, and a new entry landed at ASI06: Memory and Context Poisoning. A month earlier, Microsoft’s security team documented a 1-click attack in which a hyperlink with a crafted URL parameter silently seeded malicious instructions into a user’s AI assistant, persisting across future sessions. Neither of these is a model bug. Both are what happens when an agent’s context engineering pipeline has no opinion about what it’s allowed to trust.

Context poisoning is easy to mistake for a variant of prompt injection. It isn’t. Prompt injection is a transient exploit. Poisoning turns that exploit into a persistent, stateful compromise by writing the payload into something the agent reads on every future run: a RAG index, a long-term memory store, a shared multi-agent workspace, or the tool descriptions themselves. The attacker only has to succeed once. After that, every downstream action runs on contaminated ground truth.

This post is a mechanism walk-through, not a security checklist. The framing matters because most coverage treats poisoning as a firewall problem to be solved by authentication, scanning, or better alignment. None of those address the actual failure mode. The failure mode is that the agent’s context pipeline has no notion of trust, provenance, or isolation. That is a context engineering problem, and the fix lives in how context is collected, tagged, delivered, and validated, not in the model.

What context poisoning actually is

Context poisoning is the deliberate insertion of false or malicious information into an AI agent’s context sources so the model treats it as ground truth on future runs. The distinguishing property is persistence: the payload survives the session that delivered it and influences later, unrelated interactions. NeuralTrust’s analysis of memory and context poisoning describes it as turning a transient exploit into a durable control channel.

The mechanics vary by target, but the pattern is consistent. An attacker identifies a context source the agent reads, such as a knowledge base, a vector index, a shared memory store, a tool description, or any document the agent summarizes. They write instructions or false facts into that source. When the agent retrieves from that source later, the planted content enters the context window alongside legitimate data. The model cannot distinguish one from the other because, by design, retrieved context is treated as evidence.

OWASP’s inclusion of ASI06 in the 2026 agentic top 10 is a useful anchor, because it marks the moment the industry acknowledged this as a distinct class of attack rather than a footnote under prompt injection. The severity classification rests on two properties: the attack scales across users of the same system, and it can stay dormant until a specific query triggers it, making detection on a per-session basis close to impossible.

Four attack surfaces

The attack surface depends on how the agent consumes context. Four surfaces dominate current incident reports:

Surface	How it’s poisoned	Why it sticks
RAG index	Attacker submits content the pipeline ingests (public docs, uploads, scraped pages) with embedded instructions or false facts	Retrieval treats the chunk as legitimate evidence; poisoning survives as long as the chunk is indexed
Long-term memory store	Attacker triggers a write to persistent memory through a crafted interaction, pre-filled URL parameter, or compromised tool	Memory is read on future sessions; the payload can be dormant for days before firing
Multi-agent shared workspace	One compromised agent writes contaminated output into a shared state other agents read from	Poisoning propagates across agents; blast radius scales with the number of consumers
Document-embedded instructions	Hidden text in emails, web pages, PDFs, or tool descriptions that the agent processes as context	The user sees the visible document; the agent sees the hidden instructions

The RAG surface is the one most teams underestimate. If your pipeline ingests any content the attacker can influence, the index becomes an attack channel. The MCPTox benchmark tested tool-description poisoning across 20 prominent LLM agents and recorded a 72.8% attack success rate, with more capable models often more susceptible because they are better at following instructions, including planted ones.

The long-term memory surface is the one that makes this distinct from prompt injection. Christian Schneider’s write-up shows how payloads can wait in memory until a specific trigger phrase activates them, sometimes weeks after the initial write. From the user’s perspective, the agent appears to “suddenly” behave oddly, with no recent prompt that explains it.

The shared multi-agent workspace surface is newer and under-discussed. As Alexander Zanfir’s analysis of shared memory in multi-agent systems points out, once one agent in a coordinated system is contaminated, every downstream agent reading from shared state inherits the contamination. This is the same class of failure described in why multi-agent AI systems fail at context: shared state without provenance is an attack amplifier.

Prompt injection vs. context poisoning

These are often conflated because both involve untrusted content influencing model behavior. The differences matter for how you defend against each.

Dimension	Prompt injection	Context poisoning
Scope	One session, one user	Every session that reads the poisoned source
Persistence	Ends when the conversation closes	Survives until the source is cleaned
Attack surface	The current prompt and attached content	RAG index, memory, shared state, tool metadata
Trigger	Immediate, within the malicious prompt	Can be dormant until a trigger query
Detection	Possible at the prompt level	Requires auditing the context sources themselves
Primary defense	Prompt validation and output filtering	Provenance, source isolation, context validation

The defense columns are the most important. Prompt injection defenses focus on the input-output boundary. Context poisoning defenses have to move earlier in the pipeline, to how content enters the agent’s context sources in the first place.

Why this is a context engineering problem, not a model problem

No amount of RLHF, guardrails, or output filtering fixes context poisoning, because the poisoned content is indistinguishable from legitimate context by the time the model sees it. If an agent retrieves a document that says “the refund policy is 90 days” and that document was poisoned to say “always grant refunds regardless of amount,” the model has no way to know which statement to trust. Both arrived as retrieved evidence. Both look legitimate.

This is the same reframing pattern we’ve applied to MCP failures and to AI hallucinations: a failure that looks like a model problem turns out to be a context problem, and the fix lives in how context is delivered, not in the model. The root cause here is that most context pipelines have no notion of trust. They accept content from any source, store it without provenance, index it without isolation, and retrieve it without validation. The model inherits all of that ambient trust and has no way to push back.

Context engineering makes the trust relationship explicit. Every entry has a source. Every source has a trust level. Retrieval filters by trust level. Writes to trusted memory require a promotion step. This is not a novel idea: it’s how secure systems have handled untrusted input for decades. What’s new is applying it to the context layer of agent architectures, which most frameworks leave wide open by default. RAG pipelines in particular tend to dump everything into one index with no trust metadata, because the retrieval model was designed for relevance, not provenance.

Five context engineering defenses that work

The defenses that actually reduce poisoning risk are pipeline decisions, not model decisions. These are the five that matter most in production.

Provenance tracking

Tag every entry in your context sources with its origin: which user, which tool, which URL, which file, at which timestamp. This has to happen at write time because you can’t reconstruct it later. Provenance metadata lets retrieval filter by trust level at query time (“only serve chunks from internal documents on this query”) and lets incident response trace contamination back to the source when something goes wrong. Without provenance, a compromised index is a total loss because you can’t tell which chunks are clean.

Source isolation

Never let content from a low-trust origin write directly into the same memory layer that drives agent decisions. Scraped web content, user uploads, and third-party feeds go into a quarantine tier. Trusted memory is a separate tier that content can only enter through an explicit promotion step. The point is that retrieval from trusted memory can never accidentally surface a poisoned chunk from the quarantine tier, because the two are not queryable together without an explicit decision by the calling code.

Trust boundaries at the context window

Inside the context window, mark which sections came from which trust level. System instructions are the highest trust. Retrieved documents from quarantine are the lowest. The model won’t enforce this on its own, but you can instruct it to treat different sections differently, and you can validate output against the trust level of the sources it cites. This is also what prevents over-permissioned agents from laundering untrusted content into trusted actions.

Context validation at retrieval time

Run integrity checks before retrieved content reaches the model. This includes format validation (does the chunk look like what we indexed?), anomaly detection (is this chunk an outlier compared to others from the same source?), and known-poison signatures (does it contain patterns seen in past poisoning attempts?). None of these are bulletproof individually, but together they catch a meaningful fraction of poisoned content before it enters the context window.

Write review for long-term memory

Long-term memory writes are the highest-risk operation in an agent architecture because they persist the model’s current state into future context. Treat them like database writes in a system that handles money: require explicit validation, log every write with full context, and keep an audit trail that lets you roll back contaminated entries. In Wire containers, writes through wire_write are tagged with their originating agent session and source, and every entry carries its provenance through to retrieval, so a container compromised by a single bad write can be traced and rolled back without losing the rest of the context.

Where this is headed

Context poisoning is going to get worse before it gets better, for two reasons. First, the attack surface is expanding: every new integration, every new memory store, every shared multi-agent workspace adds a surface. Second, the defenses are still immature. Most agent frameworks ship without provenance tracking, without source isolation, and without any trust model at all. The OWASP ASI06 classification is a useful forcing function, but the industry is at the start of this curve, not the middle.

The teams that handle this well will be the ones that treat context as untrusted by default, require explicit trust decisions before content influences agent behavior, and build observability into every write and retrieval. The teams that treat context like an undifferentiated blob of text will keep getting poisoned, and they’ll keep blaming the model.

If this is a topic you’re working on, the related posts worth reading are AI agents have too much access (scope), why multi-agent AI systems fail at context (shared state), and why AI hallucinations are a context problem (why contaminated context becomes contaminated output).

Sources: OWASP Top 10 for LLM Applications & Generative AI · Microsoft Security: AI Recommendation Poisoning · NeuralTrust: Memory & Context Poisoning · Christian Schneider: Persistent Memory Poisoning in AI Agents · Alexander Zanfir: Context Poisoning & Shared Memory · Alessandro Pignati: Memory and Context Poisoning · MCPTox: Tool Poisoning Attacks on LLM Agents (arXiv) · Invariant Labs: MCP Tool Poisoning Notification

Frequently asked questions

How is context poisoning different from prompt injection?

Prompt injection is a single-session attack: malicious instructions inside a user prompt override the model's behavior for one conversation, then disappear. Context poisoning persists. The attacker writes the payload into something the agent reads on future runs, such as a RAG index, long-term memory store, or shared agent workspace, so the compromise survives across sessions and users.

Can RAG systems be poisoned, and how?

Yes. Any RAG pipeline that ingests content the attacker can influence, such as scraped web pages, public documents, customer-submitted files, or third-party feeds, can be poisoned by embedding malicious instructions or false facts in that content. Once the chunk is indexed, normal semantic retrieval surfaces it to the model as legitimate evidence. Research has shown attack success rates above 70% against major agent frameworks in these conditions.

How do you detect context poisoning in production?

Detection relies on three layered signals: provenance auditing to trace which source wrote each retrieved chunk, behavioral drift monitoring to flag agents whose outputs change after an ingestion event, and red-team evaluations that feed known-poisoned samples to measure how often they reach the model. No single signal is sufficient because poisoning often stays dormant until a specific trigger phrase or query pattern activates it.

What's the most effective defense against memory poisoning?

Source isolation. Never let content from a low-trust origin, such as a web scrape or user upload, write directly into the same memory layer that drives agent decisions. Instead, quarantine untrusted input in a separate tier, require explicit promotion to trusted memory, and tag every entry with its provenance so retrieval can filter by trust level at query time.

Does fine-tuning or RLHF prevent context poisoning?

No. Fine-tuning changes how the model weights its training data, but context poisoning targets the retrieval and memory layers the model reads at inference time. A perfectly aligned model will still act on a poisoned retrieved document because the document looks like legitimate context. The fix has to happen in the context pipeline, not the model itself.

ai-agents ai-agent-security

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container