Meta context engineering beats hand-tuned context

Jitpal Kocher · · 9 min read

Key takeaway

Meta context engineering (MCE) learns the context-engineering process itself instead of hand-crafting prompts and examples. In ICML 2026 results from a Peking University team, MCE delivered an 89.1% average improvement over a base agent across five domains, versus 70.7% for ACE-style additive curation, while training 13.6 times faster. The practical takeaway is that the highest-leverage work shifts from writing context to designing the substrate and feedback loop an agent uses to write its own.

Meta context engineering (MCE) is a 2026 method that learns the context-engineering process itself rather than hand-crafting prompts and examples, and it beats the previous best by a wide margin. In results from a Peking University team presented at ICML 2026, MCE delivered an 89.1% average improvement over a base agent across five domains, compared with 70.7% for ACE, the additive-curation framework that held the prior record. It did this while training 13.6 times faster and using 4.8 times fewer rollouts.

The headline gain is striking, but the more useful idea is what it implies about where effort should go. For two years the advice has been to get good at context engineering: write better instructions, curate better examples, structure the inputs an agent sees. MCE is evidence that the procedure for doing this is itself learnable, and that a learned procedure outperforms a hand-designed one. The work humans do by hand is becoming the thing the system optimizes.

The lineage: prompts, then playbooks, then the skill itself

Each generation of this work has moved the human one level further from the tokens. Prompt engineering put a person in charge of writing the instruction string directly. Agentic context engineering, the ACE framework we covered in our analysis of self-evolving contexts, handed that job to the system: ACE maintains an itemized playbook of lessons drawn from execution feedback, using a Generator that produces reasoning, a Reflector that extracts insights, and a Curator that merges them as structured deltas. ACE beat tuned prompts by 10.6% on agent benchmarks precisely because it stopped rewriting context as prose and started accumulating it as structured entries, which avoids the brevity bias and context collapse that erode hand-maintained prompts.

MCE moves up one more level. ACE still relies on a fixed procedure: the three-role loop and the itemized playbook format are designed in advance and applied to every task. The MCE authors argue that this fixed workflow is the new ceiling. Prompt rewriting (as in GEPA), additive curation (as in ACE), and hand-built agent harnesses all share the same limitation: a human decided ahead of time what good context looks like and how it should be assembled, and the system can only fill in that template. MCE treats the context-engineering skill itself as a learnable object, so the procedure can change per task rather than being locked.

What “meta” means here: skills that build context

MCE is a bi-level framework, which is the part of the name worth unpacking. There are two agents working at different levels of abstraction.

The meta-level agent evolves skills. A skill is not a single prompt; it is a bundle that can include methodology, executable code, context templates, and operators that transform context. The meta-level agent reads task specifications and performance history, then generates improved skills through agentic crossover, recombining what worked across prior attempts. This is the level where “how to do context engineering” is learned.

The base-level agent runs those skills to produce the actual context for a task. Critically, the output is context as files and code with no imposed structure. Where ACE forces every lesson into an itemized list, MCE lets the learned skill decide the representation, including writing code that assembles context at runtime. The representation is searched, not assumed.

ApproachWhat is learnedWhat is fixed by handContext representation
GEPAAn optimized prompt stringThe rewrite-and-evaluate procedureA single prompt
ACEThe content of an itemized playbookThe three-role loop and the list formatA structured list of entries
MCEThe context-engineering skill itselfAlmost nothing above the skillWhatever the skill produces, including files and code

The progression is consistent: each step removes one hand-designed assumption. MCE removes the assumption that there is one right way to represent and assemble context.

The numbers worth keeping

The reported gains are large in both the offline and online settings, and the gap widens where it matters most. Offline, where context is engineered before deployment, MCE reached an 89.1% average relative improvement over the base agent versus 70.7% for ACE. Online, where the agent adapts during deployment, MCE reached 74.1% versus 41.1% for ACE.

MetricMCEACENotes
Offline relative gain over base89.1%70.7%Averaged across five domains
Online relative gain over base74.1%41.1%Adapting during deployment
Training speed13.6x fasterbaselinevs ACE
Rollouts to converge4.8x fewerbaselinevs ACE
Context length1.5K to 86K tokensfixed formatScales to the task
Domains5sameFinance, chemistry, medicine, law, AI safety

Two details deserve attention. The first is the online gap of roughly 33 points, much wider than the offline gap of 18. Online is the regime most production agents actually live in, adapting as inputs and conditions drift, and it is exactly the regime where a fixed procedure ages worst. A learned procedure that can change its own representation holds up better as the task moves.

The second is the dynamic context length. MCE does not commit to a token budget. It expands and contracts between 1.5K and 86K tokens depending on what a task needs, which is a different stance from the usual practice of picking a context budget and forcing every task into it. The right amount of context turns out to be a property of the task, learned alongside everything else.

Why learned context beats hand-engineering

Hand-engineering encodes a fixed theory of what good context looks like, and that theory becomes the ceiling. When a person writes a prompt or a curator maintains an itemized list, they have committed to one representation, and every task is forced through it. A finance reasoning task and a chemistry protocol task get the same shape of context even though the structure that helps each is different. MCE lets the representation be part of what is searched, so different tasks can converge on structurally different context, including code where code is the clearer encoding.

This is the substrate-versus-harness distinction we have made before: the durable leverage is not in a cleverer agent loop but in the substrate context lives in, and in the feedback signal that tells the optimizer what worked. A learned context optimizer is only as good as the store it reads from and the evidence it gets back. MCE writes context as files and code and learns from execution feedback, which means the substrate it reads from and writes back to matters as much as the optimizer: a flat prompt string gives a learned curator nothing to diff against from one cycle to the next. An agent connected to a Wire container writes back into individually addressable, provenance-tagged entries instead, so each revision lands as a discrete, inspectable change the next cycle can reason about rather than a blind overwrite of one long string (how Wire scopes what an agent reads and writes).

The same point holds at the storage layer. Even when an agent generates its own context, that context benefits from being stored with structure rather than as raw text, because structure is what makes the next revision targeted instead of total. This is the case for structured context over raw text applied to machine-written context, and it is why feedback signals like helpful and harmful counters, a form of epistemic provenance, do more work than free-text rationales: they are comparable across cycles and resistant to drift.

What this means for how you build

If you are hand-tuning prompts and examples for a long-running agent, the frontier is no longer a better prompt. It is the loop and the substrate around it. A few specific moves follow from the MCE result, whether or not you adopt the framework.

Optimize the procedure, not the artifact

The instinct is to keep improving the prompt or the example set. MCE suggests the higher-leverage target is the procedure that produces them. Even a lightweight version of this, a step that proposes and tests changes to how context is assembled, beats endlessly polishing a single static artifact.

Give the optimizer a substrate it can diff

A learned curator needs to see what changed between cycles. Context stored as one long string forces every update to be a rewrite, which is the operation that compounds context collapse. Context stored as addressable entries lets updates be discrete and inspectable, which is what makes iterative improvement stable.

Treat execution feedback as the training signal

MCE, like ACE before it, learns from whether actions succeeded or failed in the environment rather than from labeled data. Most agent stacks treat tool errors as exceptions to swallow and successes as log lines. Capturing both cleanly, as inputs to the next context cycle, is a different engineering problem from improving the model, and it is the one that pays off here.

Budget for variable context, not fixed context

If the optimal amount of context is a learned property of the task, a fixed token budget is leaving accuracy on the table for hard tasks and wasting tokens on easy ones. Design for context length that scales with task difficulty rather than a single number applied everywhere.

The open problem: portability across agents

MCE’s evolved skills and artifacts are per-task and per-agent in the released setup, which leaves the same gap ACE left. When a fleet of agents shares a workload, there is no clean way for the context-engineering skill one agent learns to propagate to the others. The field is converging on the principle that detail accumulates safely when assembly, storage, and feedback are separate operations with their own rules, but the interchange format for moving learned context between agents is still unsettled. That portability problem is where a lot of the next year of context engineering work is likely to go.

Takeaways

Three things from MCE are worth keeping even if you never run it. First, the highest-leverage work on an agent is shifting from writing context to designing the procedure and substrate that let the agent write its own. Second, learned context beats hand-engineered context partly because it can change its representation per task instead of forcing everything through one fixed shape, and the advantage grows in the online regime where production agents live. Third, none of this works without a store the optimizer can diff against and a clean execution-feedback signal, which are engineering problems separate from the model and mostly unsolved in current stacks. The result is real, and it points at what context engineering becomes when the engineering itself is learned.


Sources: Meta Context Engineering via Agentic Skill Evolution (arXiv 2601.21557) · MCE GitHub implementation · Agentic Context Engineering (arXiv 2510.04618) · A Survey of Context Engineering for Large Language Models (arXiv 2507.13334)

Frequently asked questions

How is meta context engineering different from agentic context engineering (ACE)?
ACE learns the content of an agent's context: it maintains an itemized playbook of lessons using a fixed Generator, Reflector, and Curator loop. Meta context engineering learns the procedure itself, evolving the skills that decide how context is represented and assembled, so the structure is not fixed in advance. In the paper's results MCE beats ACE by 18.4 points offline and 33 points online.
Does meta context engineering replace prompt engineering?
It moves the human another level away from the tokens. Instead of writing prompts or maintaining a playbook by hand, you design the feedback loop and the store the agent reads from and writes back to. Prompt engineering still matters for narrow, stable tasks where the cost of a learning loop is not justified.
When should you let an agent engineer its own context instead of hand-tuning it?
Learned context optimization pays off when an agent runs over many sessions on the same domain, when execution feedback like task success or tool errors is available, and when the task surface is wide enough that one hand-written representation leaves accuracy on the table. For a single narrow task with a stable input shape, a fixed prompt is cheaper and good enough.
What does learned context optimization need from a context store?
It needs a substrate it can diff against. A flat prompt string gives a learned curator nothing to compare across cycles, so revisions become blind overwrites. Individually addressable, provenance-tagged entries let each revision land as a discrete, inspectable change the next cycle can reason about.
Is learned context optimization worth the compute cost?
The 2026 results suggest the cost concern is now backwards. MCE reached higher accuracy than ACE while training 13.6 times faster and using 4.8 times fewer rollouts, because it stops re-running an expensive fixed procedure and instead reuses learned skills across tasks.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container