Agent drift: why long-running AI agents lose the plot
Key takeaway
Agentic context engineering treats an agent's context as an evolving playbook, not a fixed prompt. ACE, an ICLR 2026 framework from Stanford, SambaNova, and Microsoft, beats tuned prompts by 10.6% on agent benchmarks by splitting context updates across a Generator, Reflector, and Curator. The result avoids two common failures of system-prompt tuning: brevity bias and context collapse.
A research framework called ACE, presented at ICLR 2026 by teams from Stanford, SambaNova, and Microsoft, reports that an agent whose context evolves at runtime beats one running a tuned prompt by an average of 10.6% on agent benchmarks and 8.6% on finance tasks. The same paper shows the approach matching IBM’s CUGA, currently the top-ranked production agent on AppWorld, while running on a smaller open-source model. On the harder test-challenge split, the evolving-context approach surpassed CUGA outright.
The headline gain is interesting, but the more useful result is buried in the failure analysis. ACE’s authors identify two specific reasons that the usual approaches to keeping an agent’s context useful, prompt tuning and iterative system-prompt rewriting, quietly degrade. Both failure modes show up in production whether or not a team has read the paper. Understanding them changes how you build context engineering into an agent loop, regardless of whether you adopt the specific framework.
Brevity bias is the tendency of language models to drop domain-specific detail when asked to compress or summarize. The paper documents this directly: ask a model to rewrite its own operating instructions, and the rewrite is consistently shorter, cleaner, and missing the specific edge cases that made the original useful. The compression looks like an improvement on first read and degrades agent behavior in practice.
Context collapse is the longer-term version of the same problem. When a system prompt is rewritten iteratively, by the model itself or by a tuning loop, each rewrite is lossy. Detail erodes a little on each pass. After enough rewrites, the playbook is a vague shadow of the version that worked. Both failures share a root cause: most “self-improving” systems collapse reflection and curation into a single model call, and that call is implicitly an act of summarization.
This matters because the standard advice for maintaining agent quality, periodically clean up the system prompt and add lessons learned, runs into brevity bias on every cleanup and context collapse on every iteration. The agents that survive longest are the ones whose context grows in structured, append-mostly ways rather than being rewritten. It is a familiar shape to anyone working on agent memory: this is context rot at the system-prompt level. The same signal-to-noise degradation that hits long conversation histories hits the operating context that drives the agent’s behavior.
ACE separates the work of evolving context into three distinct roles, each with a narrow job. The split matters more than the specific implementation: collapsing any two of these roles into one call is what produces brevity bias.
| Role | What it does | What it does not do |
|---|---|---|
| Generator | Produces reasoning trajectories for new queries, exposing effective strategies and recurring failures | Decides what to keep or merge into the stored playbook |
| Reflector | Extracts concrete insights from successes and errors after the fact | Rewrites the existing context or chooses how to integrate new insights |
| Curator | Converts extracted insights into structured delta updates with helpful/harmful counters; performs deterministic merging, deduplication, and pruning | Does free-form summarization or generates new prose for the playbook |
The Curator stage is where the architecture earns its results. Updates are applied as deltas, not rewrites. Each entry in the playbook carries explicit counters for how often a strategy helped or hurt, plus deterministic dedup logic that prevents the same lesson from accumulating in multiple forms. Detail accumulates without being summarized away. The model is never asked to produce a shorter version of the playbook from scratch, which is the operation that triggers context collapse.
The reported gains track this design. ACE preserves structure and counters, so the playbook scales with long-context models rather than being squeezed into a smaller prompt. Compared to GEPA, the prior state of the art for prompt optimization on these benchmarks, ACE reports 82.3% lower offline adaptation latency and 75.1% fewer rollouts to converge. The compute saved is the compute that would otherwise be spent rewriting and re-evaluating shortened prompts.
The headline benchmark results from the ACE paper are worth quoting precisely, because they make a specific claim about where evolving context pays off.
| Metric | Result | Comparison |
|---|---|---|
| Agent benchmarks | +10.6% average | vs strong tuning baselines |
| Finance benchmarks | +8.6% average | vs strong tuning baselines |
| AppWorld overall | Matched IBM CUGA | ACE on DeepSeek-V3 vs CUGA, the top production agent on AppWorld |
| AppWorld test-challenge | Surpassed CUGA | Smaller open-source model beats production agent on the harder split |
| Adaptation latency | -86.9% | vs strong baselines |
| Rollout cost | -83.6% | vs strong baselines |
| GEPA latency comparison | -82.3% | ACE vs GEPA on AppWorld |
| GEPA rollout count | -75.1% fewer | ACE vs GEPA on AppWorld |
| Supervision required | None | Uses natural execution feedback (success, errors, trajectory outcomes) |
The AppWorld result is the one to internalize. CUGA is a production agent shipped by IBM, built specifically for the benchmark and tuned at scale. ACE running on a smaller, open-source model matches it on average and beats it on the harder split, using nothing more than execution feedback as a learning signal. The implication is not that DeepSeek-V3 is suddenly competitive with IBM’s pipeline; it is that the way the context evolves matters more for agent quality than the size or polish of the underlying model in the regimes most production teams operate in.
The lack of supervision requirement is the second thing worth keeping. Most prompt-optimization work assumes a labeled benchmark to evaluate against. ACE does not. It uses whether the agent’s actions succeeded or failed in the environment as the only feedback signal. For teams whose agents are running in production environments without clean ground-truth labels, that constraint matches the actual operating conditions.
The mechanism translates to a few specific design moves that apply whether or not you adopt ACE itself.
Most teams that try to make an agent learn from its own work do this in one step: ask the model to “review the session and update its instructions.” That is exactly the operation that produces brevity bias. The reviewing model has incentives to write cleanly and concisely, and concision drops the specific facts that made the original instructions useful. Splitting the work into a separate reflection step (extract concrete observations) and a separate curation step (deterministically merge them into stored context) eliminates the failure mode.
Storage matters. A playbook stored as a single string gets rewritten on every update, which is the operation that compounds context collapse. A playbook stored as a list of entries, each with metadata and counters, can be appended to and pruned without rewriting the whole structure. Detail accumulates without being smoothed away. This is the same principle behind structured context for raw input data, applied to the operating context that drives agent behavior.
ACE’s helpful/harmful counters look unsophisticated next to natural-language explanations, and that is the point. Counters are deterministic, comparable across sessions, and resistant to the kind of drift that hits free-text rationales. When the Curator decides which entries to prune, it is reading numbers, not paragraphs. Free-text reasoning has a place during reflection; it should not be the persistent representation of what the agent has learned.
Most agent stacks treat tool errors as exceptions to handle and successful completions as logging events. ACE treats both as inputs to context evolution. The same trajectory that completed successfully should produce one or more new playbook entries; the same trajectory that errored should produce entries about the failure mode and the recovery. Designing the agent loop so this feedback is captured cleanly is a different problem from optimizing the model itself, and most current agent platforms do not make it easy.
A playbook that compresses every cycle stays small and loses detail. A playbook with deterministic dedup and pruning grows with the agent’s experience, and that growth is the source of the quality gain. Teams designing context for long-running agents should plan for the playbook to be larger than the original system prompt, not smaller. The right metric is detail preserved per unit of context, not total context size.
Persistent agent memory shipped from Anthropic and several other providers in early 2026, as covered in our analysis of Managed Agents memory. The design questions that launch surfaced (scope, freshness, conflict, trust) are the same design questions ACE answers from a different angle. ACE’s Curator is the same role as the memory-write policy on a memory platform. ACE’s helpful/harmful counters are the same kind of provenance signal that lets a retrieval layer downweight stale entries. The architectures converge on the same principle: detail accumulates safely when reflection, storage, and retrieval are separate operations with their own rules.
The harder problem ACE does not solve directly is cross-agent context portability. The playbook is a per-agent artifact in the released implementation. Teams running agent fleets or multi-agent systems still face the open question of how learned context propagates across agents that share a workload. The Wire approach treats every container entry as inspectable and provenance-tagged, so the same lessons surface anywhere a compliant agent client connects through MCP; the broader field is still working out what the right interchange format looks like.
If you are building a long-running agent in 2026, three things from ACE are worth taking even if you never run the framework. First, do not rewrite your agent’s context, append structured deltas to it. Second, separate the agent that extracts lessons from the system that merges them, so brevity bias does not eat your edge cases. Third, use execution outcomes as the feedback signal, not labeled benchmarks, because that is what your agent actually has access to in production. The gain ACE measures is real, and the design moves behind it work outside the specific framework. They are what agentic context engineering looks like in practice.
Sources: Agentic Context Engineering (arXiv 2510.04618) · ACE on OpenReview (ICLR 2026) · ACE GitHub implementation · SambaNova: ACE Open-Sourced · Hugging Face paper page · Microsoft Research publication · Softmax: The Biggest Lesson from ACE
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container