Tool-based agent memory: why 2026 benchmarks favor it
Key takeaway
Context engineering is moving out of harness internals and into shared substrates that any agent tool can consume. OpenAI's codex-plugin-cc, AGENTS.md joining the Linux Foundation Agentic AI Foundation, and a wave of Claude Code to Codex migration tools are signals of one direction: the durable layer is files and MCP servers, not prompt scaffolding inside one CLI. The substrate has three tiers: static instructions in AGENTS.md, live tools over MCP, and navigable retrieval results carrying provenance. Only the third tier is still mostly missing in production systems.
OpenAI shipped codex-plugin-cc, an official package that lets you call Codex from inside Claude Code without leaving the workflow. The plugin shares the same Codex install, the same authentication state, and the same repository checkout. A few months earlier, AGENTS.md, OpenAI’s instruction-file format, joined the Linux Foundation under the new Agentic AI Foundation alongside Anthropic’s Model Context Protocol and Block’s goose. A wave of community migration tools has filled the gap the official plugin does not: claude2codex, cc2codex, and ai-config-sync-manager translate skills, agents, MCP servers, and permissions between harness vocabularies.
Treat these as separate features and they look like routine ecosystem plumbing. Treat them as one signal and the picture is sharper. The industry has stopped pretending the harness is the moat. Context engineering, the discipline of getting the right information to a model at the right time, is moving out of CLI internals and into shared substrates that any harness can consume. The post you wrote inside one tool’s prompt scaffolding does not survive a tool switch. The substrate the harness reads from does.
Three artifacts shipped between August 2025 and May 2026 make the substrate move concrete. Each one externalizes a layer of context that used to be locked inside a single coding agent. Read together, they describe the perimeter of what is now portable across harnesses.
| Artifact | What it is | What it makes portable |
|---|---|---|
| AGENTS.md | OpenAI-originated instruction file, now stewarded by the Linux Foundation Agentic AI Foundation | Project-level instructions, conventions, build commands |
| codex-plugin-cc | Official OpenAI plugin that runs Codex from inside Claude Code, sharing auth and repo state | Compute and authentication, not memory |
| Migration tools (claude2codex, cc2codex, ai-config-sync-manager) | Community CLIs that translate skills, agents, MCP servers, permissions across harnesses | The configuration layer below the system prompt |
AGENTS.md was released by OpenAI in August 2025 as a markdown-shaped, README-style file that coding agents read before doing work. By the time the Linux Foundation announced the Agentic AI Foundation on December 9, 2025, AGENTS.md was being read natively by Codex, Cursor, GitHub Copilot, Windsurf, Amp, Devin, Gemini CLI, Aider, Zed, Warp, VS Code, and roughly twenty other tools, and adopted by more than 60,000 open source projects. Claude Code reads its own CLAUDE.md by default, with a one-line config flag in ~/.codex/config.toml that lets Codex fall back to CLAUDE.md when AGENTS.md is missing. The point is not which file wins. The point is that the file lives outside the harness.
codex-plugin-cc takes the move one layer deeper. It does not import Claude Code’s prompt structure or memory. It hands off to the same Codex subprocess Claude Code’s user already authenticated, on the same checkout, and routes the result back. Claude Code becomes a viewer. Codex becomes a callable. Both share the working tree.
The community migration tools handle what the official plugin will not. ai-config-sync-manager translates Codex’s TOML-frontmatter agents to Claude Code’s YAML, maps Codex’s sandbox_mode, web_search, and prefix_rule permissions to Claude tool permissions, and keeps MCP server configurations in sync bidirectionally including bearer-token environment variables. claude2codex does a one-shot migration of plugins, MCP servers, memory files, and harness configs. The tools differ in scope. They agree on what is portable: anything that lives in a file or a server, and very little of what lives in the harness’s runtime prompt.
The portable layer is the substrate; the harness is the renderer. Anthropic’s engineering team defined context engineering in 2025 as the strategies for curating and maintaining the optimal set of tokens during LLM inference. Most teams read that and think prompt structure, system message ordering, memory layout inside one CLI. Read it again with the shipping evidence in front of you and a different frame appears: the optimal token set is curated upstream of any one harness, in artifacts that the harness reads.
Three things follow.
First, the harness is interchangeable on purpose. A team that runs Claude Code today and Codex next quarter, with Cursor for design work and Copilot inside the IDE, is not migrating context. It is pointing different harnesses at the same substrate. That substrate is AGENTS.md plus MCP servers plus skills folders plus permission files. The harness picks them up. If a feature only works inside one harness’s prompt, it is not part of the substrate, and it dies on tool switch.
Second, harness-internal context engineering hits a ceiling. The best-in-class harness can compress, compact, summarize, and re-rank context inside its window. None of that survives a handoff to the next harness, the next session, or the next agent. The team’s accumulated work, the entries that capture how the codebase actually behaves, the provenance attached to each retrieval, has to live in something durable. Why every agent handoff corrupts your context is the same problem one layer down: when context is held in flight, every transition leaks.
Third, provenance only makes sense at the substrate layer. If the agent will be a different agent next week, the only durable place to attach source identity, position, and typed edges is the retrieval result itself. Provenance is a context engineering primitive argues this directly: typed metadata is part of the tool contract, consumed at inference time, regardless of which harness is rendering the conversation.
Not every harness-neutral surface is equally load-bearing. AGENTS.md, MCP servers, and provenance-bearing retrieval results form a gradient from static to live to navigable, and conflating them obscures what each one buys.
Static instructions are the floor. AGENTS.md is a markdown file that ships in the repository and stays the same between calls. It carries the kind of context a README cannot carry without confusing a human reader: build steps a model can actually run, test commands with expected outputs, code conventions stated as instructions rather than examples, security rules expressed as policies. Codex’s discovery process walks from project root down to the current working directory, layering closer files over earlier guidance, and most of the major coding harnesses now read this hierarchy the same way. The win is consistency. Any harness reads the same file and gets the same baseline.
Live tools are the middle tier. MCP servers are runnable surfaces. The harness calls a tool, the server returns a result, the agent acts on it. The roadmap covered in What MCP’s 2026 roadmap means for context delivery is making this layer load-balanced, governed, and auditable. The substrate moves from “what should this agent know” to “what can this agent do, and what state does it return.”
Navigable context is the third tier and the one most retrieval APIs still skip. The tool result itself has to carry enough structure that the agent can plan its next call without re-asking the server. Source identity, chunk position, total chunks, ingestion timestamp, typed relationship edges to neighbors. This is the layer that turns substrate from a destination into a graph, and it is what stops multi-agent systems from collapsing on context handoff. The shipping evidence so far has populated the first two tiers. AGENTS.md fixes the static layer. The MCP roadmap is shoring up the live layer. The third tier is where the next round of work is, and it is the layer where free-text quality notes, trust scores, and episodic memory blobs do not belong, because they collapse a judgment that depends on the agent’s task back into the server.
The substrate is what compounds; the harness is what swaps. Three practical consequences for teams building agents.
Move durable context out of the harness. If your agent’s behavior depends on a long system prompt, pinned files, or a memory layer that only one harness exposes, plan for the day a teammate prefers a different harness. The substrate-shaped move is to put the same context into AGENTS.md (for static instructions), an MCP server (for live retrieval), or a structured retrieval result (for grounded answers). The reward is portability without rewrite, and a smaller blast radius when the next harness ships an incompatible memory feature.
Pick MCP transports that survive a fleet. The 2026 roadmap is moving Streamable HTTP toward stateless operation. Build against that shape now: stateless transport, stateful state held server-side, sessions resumable across instances. A harness that reconnects to a server it has talked to before should pick up where it left off. A harness that hands off to a different agent should be able to point that agent at the same server.
Make provenance non-negotiable in tool returns. If a retrieval call returns just an id, a score, and a content blob, the agent has to reconstruct the rest from the text alone, and a different harness will reconstruct it differently. Source identity, position, time, and typed edges should live in the result. They are the structure that turns substrate into something an agent can navigate without harness-specific logic, and they are the smallest unit of work that future-proofs your retrieval layer against the harness churn the rest of the stack is about to go through.
In Wire’s deployment, each context container runs as a remote MCP server on a per-organization subdomain, so connecting Claude Code, Codex, Cursor, or any other MCP-aware harness to the same container yields the same entries with the same provenance attached. The harness becomes whatever the developer has open that day; the substrate is what holds the context.
When OpenAI ships codex-plugin-cc and contributes AGENTS.md to the Linux Foundation, the implicit message is that the substrate is the value. Anthropic’s contribution of MCP to the same foundation says the same thing in a different vocabulary. The harness layer will keep churning. New CLIs will arrive, old ones will refactor their memory features, and migration scripts will keep being written. The substrate layer is what teams should be investing in, because it is the part that survives every harness change, every agent rewrite, every model upgrade. Context engineering reads less like “tune your prompts” and more like “design your substrate” once you accept which layer compounds.
If you want to see what a substrate-shaped context layer looks like in practice, Wire gives you a remote MCP server, structured entries, and provenance on every retrieval as soon as you create a container.
Sources: openai/codex-plugin-cc · Codex AGENTS.md guide · AGENTS.md · Linux Foundation: Formation of the Agentic AI Foundation · OpenAI: co-founds the Agentic AI Foundation · Sync Codex and Claude Code configs (community thread) · claude2codex migration tool · Anthropic: Effective context engineering for AI agents
Related
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container