Agent drift: why long-running AI agents lose the plot
Key takeaway
Anthropic launched Memory for Managed Agents on April 23, 2026 in public beta, with cross-session learning, per-write audit trails, and customer results like Rakuten's 97% error rate reduction. The launch makes persistent agent memory a standard part of the agent stack and reframes the harder questions for context engineers around scope, freshness, conflict, and trust rather than storage.
On April 23, 2026, Anthropic launched Memory for Managed Agents in public beta. The feature stores what agents learn across sessions as files, with per-write audit logs, programmatic access, and the ability for one agent to share what it learned with other agents in the same workspace. Customers cited in the launch include Rakuten (97% error rate reduction, 27% lower cost, 34% lower latency on agent workloads), Netflix (cross-session continuity), and Wisedocs (recurring-issue detection in document verification pipelines).
The feature itself is straightforward. The bigger story is what its arrival changes for teams designing agents in 2026. Persistent memory has moved from a research topic and a patchwork of bespoke vector stores into a standard part of a major agent platform. For practitioners doing context engineering, that shift moves the interesting design problems from storage choice to scope, freshness, conflict, and trust. This post walks through what shipped, why the timing matters, and the four problems persistent memory makes urgent.
Memory for Managed Agents is a public-beta feature available to all customers using Claude Managed Agents. It stores entries as files on a managed filesystem, with full export, edit, and migration available through the Claude Console and the Managed Agents API. Each write becomes a session event with a timestamp, source attribution, and a rollback option. Memories can be scoped per agent, per workspace, or shared across agents in the same workspace, with permissions managed by the customer’s admin tooling.
The cross-session learning is the headline capability. A Managed Agent can read entries written during a previous session, including entries written by a different agent in the same workspace, and use them as input to the current task. Anthropic positions this as a way to let agents avoid repeating past mistakes and to share institutional knowledge across an agent fleet, rather than as a long-conversation memory feature.
Pricing for the feature was not disclosed in the launch, though Managed Agents continue to bill at standard Claude API rates. The beta is gated behind the existing Managed Agents access; teams already using the product had access on day one.
Persistent agent memory already had a working ecosystem before this launch. Several dedicated memory platforms ship production-ready primitives, with different tradeoffs around storage format, retrieval semantics, and the agent-facing interface, including Wire and other dedicated memory infrastructure. Teams that did not adopt one of those have been building their own, picking a vector store and writing their own schema, write pipeline, and retrieval logic, often without good benchmarks to validate the design. The 2026 wave of memory benchmarks (LOCOMO, LongMemEval, MemoryAgentBench at ICLR 2026) confirmed that memory architecture matters at the same order of magnitude as retrieval architecture. A poor memory design can cost 90% of the latency budget and most of the accuracy gain, as our tool-based memory analysis covered.
Anthropic shipping Memory for Managed Agents adds a major model provider to that list and normalizes the assumption that an agent has memory of some kind. Teams building on Managed Agents now get a built-in option alongside the third-party choices, which means the design question they face shifts from “build, adopt, or wait?” to “what should be in memory, who can read it, and how do we know when it’s wrong?” Teams not on Managed Agents face the same shift through competitive pressure: their agents will be compared against agents that remember, regardless of which memory layer is providing it.
The customer numbers reinforce why the question matters. Rakuten’s 97% error rate reduction came from agents that stopped repeating documented mistakes, which is a memory-quality result, not a memory-storage result. Netflix’s continuity gain came from scoping memory per session series, not from any specific storage choice. The implementations behind those numbers vary; the design principles do not.
Persistent memory exposes four design problems that scoped, single-session context never had to handle. None of them are storage problems.
Scope. Once memory survives the session, the question of who sees what stops being implicit. Memory that lives at the workspace level can leak between projects. Memory tied to a specific user can be the wrong default for a team-shared agent. The agent design has to declare a scope at write time and enforce it at read time, with no reasonable default that fits every case.
Freshness. Stored facts decay. A customer’s preferred shipping address from January is wrong in April if they moved. A documented workaround from last quarter is wrong if the underlying bug got fixed. Persistent memory without an explicit freshness model becomes a slow source of context poisoning, where a confidently-wrong stored memory crowds out current information at retrieval time.
Conflict. Two writes can disagree. The agent recorded one preference in March and a different preference in April. Both entries persist with similar embeddings and similar similarity scores. Without an explicit conflict resolution rule, retrieval surfaces whichever entry the index ranks higher, which is roughly random from the agent’s perspective. The Memanto paper (April 2026) and Mem0’s v1.0 API both make conflict resolution an explicit memory operation for exactly this reason.
Trust. When an agent reads a memory entry, it needs to know where the entry came from to reason about whether it should act on it. Was the entry written by the same agent in a prior session? By a different agent in the workspace? By a tool the agent invoked? Without provenance attached to each entry, the agent treats every stored fact as ground truth, which is the failure mode that lets poisoned memory entries propagate across sessions.
These four problems compound when ignored. A workspace with unscoped memory, no freshness model, no conflict resolution, and no provenance is the architecture for agent drift at scale. Each session adds entries; each retrieval surfaces increasingly stale, conflicting, anonymous memories; each agent action gets noisier. The Anthropic launch ships strong defaults on the audit-trail axis (every write is logged, with rollback). The other three axes are still on the customer.
The published Managed Agents customer numbers carry real information about where memory pays off, separate from the marketing framing.
| Customer | Reported result | What this likely measures |
|---|---|---|
| Rakuten | 97% error rate reduction, 27% cost reduction, 34% latency reduction | Repeat-mistake elimination on a recurring agent workload |
| Netflix | Cross-session context continuity | User-scoped memory in a long-running agent assistant |
| Wisedocs | Recurring issue detection across document pipelines | Workspace-scoped memory shared across pipeline agents |
The pattern across the three is that memory paid off most where the agent was previously failing in a way memory directly addresses. Rakuten’s error rate dropped because agents stopped making the same mistake twice. Netflix’s continuity improved because user context survived between sessions. Wisedocs surfaced patterns that no single document review could see. None of these are general-purpose memory benchmarks; they are workload-specific gains from a workload-specific design.
For teams evaluating whether to enable Managed Agents memory, the relevant test is not “do agents perform better with memory in general.” The relevant test is “does this specific agent fail in a way memory would fix.” If the answer is yes, the published numbers suggest the gains are large. If the answer is no, persistent memory is more likely to slowly accumulate noise than to deliver visible value.
Scoping decisions made at the wrong layer are the most common source of memory failure in production agent systems. Three scoping levels show up consistently in working designs: per-agent (private to one agent’s task surface), per-workspace (shared across agents in the same team or organization), and per-user (tied to an individual end user across sessions). Each has a different leak surface and a different write policy.
Per-agent scope is the safest default. Memories written by an agent are readable only by that agent. The downside is that institutional knowledge does not propagate; if two agents are doing similar work, both have to learn the same lessons separately. The Wisedocs implementation suggests workspace scope is the right answer when agents are working on shared data, since cross-pipeline patterns become visible only when memory is pooled.
Per-workspace scope is the most leak-prone. Memories written for a customer-support workload can surface when a sales agent runs in the same workspace, especially if the embedding similarity happens to match. Memanto’s typed-category approach handles this with explicit category routing, where retrieval restricts itself to relevant memory types. The same pattern is expressible with workspace tags, prefix routing, or per-source segmentation. The mechanism matters less than enforcing the scope at the write boundary, not at the read boundary.
Per-user scope is the one most likely to pull in compliance considerations. Memory tied to an end user is durable user data, with all the deletion, consent, and audit obligations that implies. The Anthropic audit-trail design helps here: per-write logs and rollback are the primitives a deletion request relies on. Teams shipping per-user memory should design the deletion path before they design the write path.
Persistent memory does not invent new failure modes. It extends the existing ones. Stale memory is stale context. Memory poisoning is context poisoning at write time. Memory bloat is context rot at retrieval time. Each of the design failures that already shows up in single-session context windows shows up in persistent memory with a longer half-life and a wider blast radius.
The same primitives that keep context safe keep memory safe: scoped access, traceable writes, the ability to inspect what’s there, an explicit policy for forgetting. Wire treats every entry inside a context container as inspectable and per-write provenanced, which is the same instinct behind Anthropic’s audit-first design choice. The shared design pattern is that memory is just context that survives, and survival raises the cost of every quality problem the underlying context already had.
For teams already running an agent stack, the practical implication is that any memory work should be evaluated against the four-problem checklist (scope, freshness, conflict, trust) before storage choices are made. The benchmarks show that storage format affects the latency-versus-accuracy curve, but the customer outcomes show that scope and provenance are what move workload-level results. Get those right and the storage choice becomes a tuning decision rather than a defining one.
Three open questions are likely to shape the next twelve months of memory work for agent platforms.
Memory garbage collection patterns. Platforms have taken visibly different positions here. Mem0 ships TTL primitives; Memanto handles staleness through retrieval-time scoring; Anthropic has not yet detailed a built-in policy. Wire takes a different stance entirely and does not expire entries. Instead, every result carries provenance, including timestamps and known conflicts, so the agent can decide for itself how much weight a specific entry deserves for the current task. Whether the right answer is automatic expiration, retrieval-time scoring, or agent-side judgment is an open question, and the platform that lands sane defaults will move the field forward more than the next storage-format optimization.
Conflict resolution semantics. Update and discard tools exist; explicit policies for how to resolve write-time conflicts mostly do not. Silent background resolution (last-write-wins, implicit TTL, automatic dedup) is the lazy default, and it carries a specific failure mode: the agent’s working context changes in ways the agent cannot see or trace, which produces unexpected behavior on the next call and compounds context poisoning issues over time. A poisoned entry that quietly displaces a correct one cannot be detected from inside the reasoning loop. Wire takes the position that conflicts should surface to the calling agent rather than be resolved silently, on the same logic as the GC question: the agent has more context about the current task than the memory layer does, the right resolution is task-dependent, and any silent change is a change the agent cannot reason about. Multi-version memory with explicit, agent-visible reconciliation is the likely next step across the field.
Cross-platform portability. Memory written for a Managed Agent today is portable through export, but there is no shared format for moving memory between agent platforms. As more platforms add memory, the absence of a common interchange format becomes the lock-in surface. Wire was built around this exact problem and exposes container memory through MCP, so the same memory surface is reachable from any compliant agent client without an export-import step. The broader field still has the open question to answer: the same evolution that produced MCP for tool-calling will likely have to happen for memory, and the platform that drives the standard will set the terms.
The April 23 launch is the inflection point that moves these from research questions to platform questions. The teams that take the four-problem checklist seriously now will spend less time recovering from memory accidents in 2027.
Sources: Anthropic Managed Agents engineering post · Anthropic adds memory to Claude Managed Agents (SD Times) · Anthropic launches Memory in Claude Agents for enterprise (Testing Catalog) · State of AI Agent Memory 2026 (Mem0) · MemoryAgentBench (ICLR 2026) · Memanto: Typed Semantic Memory with Information-Theoretic Retrieval (arXiv) · Memory in the Age of AI Agents (survey)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container