What 466 AGENTS.md files teach about context engineering
Key takeaway
Meta context engineering (MCE) learns the context-engineering process itself instead of hand-crafting prompts and examples. In ICML 2026 results from a Peking University team, MCE delivered an 89.1% average improvement over a base agent across five domains, versus 70.7% for ACE-style additive curation, while training 13.6 times faster. The practical takeaway is that the highest-leverage work shifts from writing context to designing the substrate and feedback loop an agent uses to write its own.
Meta context engineering (MCE) is a 2026 method that learns the context-engineering process itself rather than hand-crafting prompts and examples, and it beats the previous best by a wide margin. In results from a Peking University team presented at ICML 2026, MCE delivered an 89.1% average improvement over a base agent across five domains, compared with 70.7% for ACE, the additive-curation framework that held the prior record. It did this while training 13.6 times faster and using 4.8 times fewer rollouts.
The headline gain is striking, but the more useful idea is what it implies about where effort should go. For two years the advice has been to get good at context engineering: write better instructions, curate better examples, structure the inputs an agent sees. MCE is evidence that the procedure for doing this is itself learnable, and that a learned procedure outperforms a hand-designed one. The work humans do by hand is becoming the thing the system optimizes.
Each generation of this work has moved the human one level further from the tokens. Prompt engineering put a person in charge of writing the instruction string directly. Agentic context engineering, the ACE framework we covered in our analysis of self-evolving contexts, handed that job to the system: ACE maintains an itemized playbook of lessons drawn from execution feedback, using a Generator that produces reasoning, a Reflector that extracts insights, and a Curator that merges them as structured deltas. ACE beat tuned prompts by 10.6% on agent benchmarks precisely because it stopped rewriting context as prose and started accumulating it as structured entries, which avoids the brevity bias and context collapse that erode hand-maintained prompts.
MCE moves up one more level. ACE still relies on a fixed procedure: the three-role loop and the itemized playbook format are designed in advance and applied to every task. The MCE authors argue that this fixed workflow is the new ceiling. Prompt rewriting (as in GEPA), additive curation (as in ACE), and hand-built agent harnesses all share the same limitation: a human decided ahead of time what good context looks like and how it should be assembled, and the system can only fill in that template. MCE treats the context-engineering skill itself as a learnable object, so the procedure can change per task rather than being locked.
MCE is a bi-level framework, which is the part of the name worth unpacking. There are two agents working at different levels of abstraction.
The meta-level agent evolves skills. A skill is not a single prompt; it is a bundle that can include methodology, executable code, context templates, and operators that transform context. The meta-level agent reads task specifications and performance history, then generates improved skills through agentic crossover, recombining what worked across prior attempts. This is the level where “how to do context engineering” is learned.
The base-level agent runs those skills to produce the actual context for a task. Critically, the output is context as files and code with no imposed structure. Where ACE forces every lesson into an itemized list, MCE lets the learned skill decide the representation, including writing code that assembles context at runtime. The representation is searched, not assumed.
| Approach | What is learned | What is fixed by hand | Context representation |
|---|---|---|---|
| GEPA | An optimized prompt string | The rewrite-and-evaluate procedure | A single prompt |
| ACE | The content of an itemized playbook | The three-role loop and the list format | A structured list of entries |
| MCE | The context-engineering skill itself | Almost nothing above the skill | Whatever the skill produces, including files and code |
The progression is consistent: each step removes one hand-designed assumption. MCE removes the assumption that there is one right way to represent and assemble context.
The reported gains are large in both the offline and online settings, and the gap widens where it matters most. Offline, where context is engineered before deployment, MCE reached an 89.1% average relative improvement over the base agent versus 70.7% for ACE. Online, where the agent adapts during deployment, MCE reached 74.1% versus 41.1% for ACE.
| Metric | MCE | ACE | Notes |
|---|---|---|---|
| Offline relative gain over base | 89.1% | 70.7% | Averaged across five domains |
| Online relative gain over base | 74.1% | 41.1% | Adapting during deployment |
| Training speed | 13.6x faster | baseline | vs ACE |
| Rollouts to converge | 4.8x fewer | baseline | vs ACE |
| Context length | 1.5K to 86K tokens | fixed format | Scales to the task |
| Domains | 5 | same | Finance, chemistry, medicine, law, AI safety |
Two details deserve attention. The first is the online gap of roughly 33 points, much wider than the offline gap of 18. Online is the regime most production agents actually live in, adapting as inputs and conditions drift, and it is exactly the regime where a fixed procedure ages worst. A learned procedure that can change its own representation holds up better as the task moves.
The second is the dynamic context length. MCE does not commit to a token budget. It expands and contracts between 1.5K and 86K tokens depending on what a task needs, which is a different stance from the usual practice of picking a context budget and forcing every task into it. The right amount of context turns out to be a property of the task, learned alongside everything else.
Hand-engineering encodes a fixed theory of what good context looks like, and that theory becomes the ceiling. When a person writes a prompt or a curator maintains an itemized list, they have committed to one representation, and every task is forced through it. A finance reasoning task and a chemistry protocol task get the same shape of context even though the structure that helps each is different. MCE lets the representation be part of what is searched, so different tasks can converge on structurally different context, including code where code is the clearer encoding.
This is the substrate-versus-harness distinction we have made before: the durable leverage is not in a cleverer agent loop but in the substrate context lives in, and in the feedback signal that tells the optimizer what worked. A learned context optimizer is only as good as the store it reads from and the evidence it gets back. MCE writes context as files and code and learns from execution feedback, which means the substrate it reads from and writes back to matters as much as the optimizer: a flat prompt string gives a learned curator nothing to diff against from one cycle to the next. An agent connected to a Wire container writes back into individually addressable, provenance-tagged entries instead, so each revision lands as a discrete, inspectable change the next cycle can reason about rather than a blind overwrite of one long string (how Wire scopes what an agent reads and writes).
The same point holds at the storage layer. Even when an agent generates its own context, that context benefits from being stored with structure rather than as raw text, because structure is what makes the next revision targeted instead of total. This is the case for structured context over raw text applied to machine-written context, and it is why feedback signals like helpful and harmful counters, a form of epistemic provenance, do more work than free-text rationales: they are comparable across cycles and resistant to drift.
If you are hand-tuning prompts and examples for a long-running agent, the frontier is no longer a better prompt. It is the loop and the substrate around it. A few specific moves follow from the MCE result, whether or not you adopt the framework.
The instinct is to keep improving the prompt or the example set. MCE suggests the higher-leverage target is the procedure that produces them. Even a lightweight version of this, a step that proposes and tests changes to how context is assembled, beats endlessly polishing a single static artifact.
A learned curator needs to see what changed between cycles. Context stored as one long string forces every update to be a rewrite, which is the operation that compounds context collapse. Context stored as addressable entries lets updates be discrete and inspectable, which is what makes iterative improvement stable.
MCE, like ACE before it, learns from whether actions succeeded or failed in the environment rather than from labeled data. Most agent stacks treat tool errors as exceptions to swallow and successes as log lines. Capturing both cleanly, as inputs to the next context cycle, is a different engineering problem from improving the model, and it is the one that pays off here.
If the optimal amount of context is a learned property of the task, a fixed token budget is leaving accuracy on the table for hard tasks and wasting tokens on easy ones. Design for context length that scales with task difficulty rather than a single number applied everywhere.
MCE’s evolved skills and artifacts are per-task and per-agent in the released setup, which leaves the same gap ACE left. When a fleet of agents shares a workload, there is no clean way for the context-engineering skill one agent learns to propagate to the others. The field is converging on the principle that detail accumulates safely when assembly, storage, and feedback are separate operations with their own rules, but the interchange format for moving learned context between agents is still unsettled. That portability problem is where a lot of the next year of context engineering work is likely to go.
Three things from MCE are worth keeping even if you never run it. First, the highest-leverage work on an agent is shifting from writing context to designing the procedure and substrate that let the agent write its own. Second, learned context beats hand-engineered context partly because it can change its representation per task instead of forcing everything through one fixed shape, and the advantage grows in the online regime where production agents live. Third, none of this works without a store the optimizer can diff against and a clean execution-feedback signal, which are engineering problems separate from the model and mostly unsolved in current stacks. The result is real, and it points at what context engineering becomes when the engineering itself is learned.
Sources: Meta Context Engineering via Agentic Skill Evolution (arXiv 2601.21557) · MCE GitHub implementation · Agentic Context Engineering (arXiv 2510.04618) · A Survey of Context Engineering for Large Language Models (arXiv 2507.13334)
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container