Constraint decay: structural rules break AI coding agents

Jitpal Kocher · · 9 min read

Key takeaway

Constraint decay is the systematic drop in AI coding agent accuracy when structural constraints (architecture patterns, database choice, ORM rules) are added to a functional spec. A 2026 EURECOM study across 80 tasks, 8 frameworks, and 10 model-agent configurations found capable agents lose 30 percentage points on average, with database constraints alone costing 19 points. Data-layer defects drive roughly 45% of logic failures. The paper frames the fix as a context engineering one: retrieval-augmented framework documentation rather than larger models.

You give an AI coding agent a clean specification: build a REST API with these 19 endpoints. The agent runs in a clean window, picks its own structure, and ships something that passes 86% of the assertions. Then you add one sentence: “use Clean Architecture, PostgreSQL, and SQLAlchemy.” The same agent, same model, same spec, drops to 46%.

That gap has a name. A 2026 paper from EURECOM and the University of Basilicata calls it constraint decay: the systematic decline of LLM agent performance as structural constraints accumulate. Across 80 backend generation tasks, 8 web frameworks, and 10 model-agent configurations, capable agents lose 30 percentage points of assertion pass rate on average from baseline to fully specified tasks. A relative loss of 40% of the L0 score.

This is a context engineering result, not a model capability one. The agents know how to build the API. They lose accuracy when the context demands they obey conventions the model has to recall from pre-training rather than have at hand. Below: what the paper measured, where the drop concentrates, and why retrieval-augmented framework knowledge is the obvious next step.

How big the constraint decay drop is

Capable AI coding agents lose 30 percentage points of assertion pass rate on average when structural constraints are added to a functional spec. The EURECOM team measured this across the 8 capable configurations in their study, defined as those scoring above 50% on the unconstrained baseline. The drop from L0 (web framework only) to L3 (architecture + database + ORM) is universal. Every configuration loses ground.

The methodology pins down causation cleanly. The team fixed a single OpenAPI spec, the RealWorld Conduit API with 19 CRUD operations, then layered constraints across four levels: L0 (framework only), L1 (one extra constraint), L2 (two), L3 (architecture + database + ORM). Each variant runs in 8 frameworks across 80 tasks, with 291 behavioral assertions per task. Agents run inside Docker, code is graded by end-to-end HTTP tests plus static verifiers that check whether the architecture, database, and ORM constraints were actually satisfied. The total evaluation burned roughly 5 billion tokens.

The headline numbers, with the configuration losing 30 or more points highlighted:

Agent + modelL0 A%L3 A%∆A%
OpenHands + MiniMax-M2.595.678.6-17.0
OpenHands + GPT-5-mini65.852.2-13.6
Mini-SWE + GPT-5-mini51.723.7-28.0
Mini-SWE + MiniMax-M2.588.658.3-30.3
Mini-SWE + GPT-5.278.248.0-30.2
Mini-SWE + Kimi-K2.585.453.7-31.7
Mini-SWE + Qwen3-Coder-Next86.446.1-40.2
OpenHands + Qwen3-Coder-Next73.027.6-45.5

The gap between assertion pass rate and pass@1 is the result that does not fit on a chart. OpenHands + MiniMax-M2.5 hit 78.6% A% on the hardest L3 tasks but only 8.3% pass@1. Nearly four runs in five failed somewhere across the 291-assertion suite. The paper makes a methodological point that matters for anyone evaluating coding agents: pass@1 is noisy because one failed assertion zeroes a whole run, so A% captures partial progress more stably. But the gap between the two tells you that even when an agent looks competent on average, the probability of a clean run on production constraints is in the single digits.

Database constraints drive most of the damage

Most of the 30-point drop comes from database constraints, not architecture or ORM rules. The paper uses a matched-pair design to isolate each constraint’s marginal effect, comparing task pairs that differ by exactly one constraint while holding everything else fixed.

ConstraintMarginal effect on A%
Specify PostgreSQL-19.3 pp
Specify SQLite-14.3 pp
Enforce Clean Architecture-9.1 pp
Enforce SQLAlchemy ORM-1.5 pp
Enforce Sequelize ORM-0.6 pp

The ORM number is misleading at first glance, and the paper flags why. The marginal effect of an ORM is measured against a baseline that already includes a database, so it captures the cost of forcing an ORM in place of raw SQL, not the cost of data-layer interaction itself. When the team did failure analysis on the 222 failed runs across two models, data-layer defects (incorrect query composition plus ORM runtime errors) made up roughly 45% of all logic errors. The database is the hard part regardless of whether you ask for an ORM on top.

Logic errors dominated failures at around 71% across both analyzed models, with the rest split between server startup failures (12 to 21%), incomplete implementations, and a handful of agents getting stuck in loops. The agents were not failing to start. They were starting servers that handled malformed queries, mis-cased authentication headers, and ORM patterns the model half-remembered from training data. Qwen3-Coder-Next in particular showed 22.6% of its logic errors as authentication misconfiguration, mostly incorrect token-prefix parsing.

Framework choice changes the difficulty by 33 points

The web framework chosen as a baseline can swing assertion pass rate by 33 points before any other constraint is added. Aggregating across all constraint levels and three models:

FrameworkAvg A%
Express51.4%
Koa50.7%
Flask49.3%
aiohttp38.4%
Fastify31.7%
Django25.4%
FastAPI24.2%
Hono18.5%

Express, Koa, and Flask form a clear top tier. They share a minimal, explicit API surface where the agent has to write out routing, dependency injection, and validation directly. FastAPI penalizes the agent for the things people normally love about it: type-hint-driven validation, dependency-injection conventions, and auto-discovery require the agent to recall framework idioms from pre-training rather than spell them out. Django pays a similar tax for convention over configuration. Hono trails because the test environment runs on Node.js, where Hono needs a compatibility adapter that is underrepresented in training data.

The pattern is the same one Anthropic discusses in their context engineering guidance: agents fail not when the task is hard but when the relevant context is implicit. What Anthropic’s context engineering guides leave out is that “implicit” includes framework conventions baked into developer muscle memory. To a Django developer, INSTALLED_APPS is obvious. To an agent generating code from scratch, it is one of a thousand framework defaults that have to be guessed.

Constraint decay is a context engineering problem

The fix the EURECOM team proposes in their conclusion is explicitly a context engineering one: “retrieval-augmented framework documentation, constraint-oriented planning, or targeted pre-training on convention-heavy codebases.” Two of the three are about getting the right context into the agent’s window at the right moment. Only the third is about changing the model.

This matches what every long-running study of agent failure has found. Agent drift is mostly a context problem. Multi-agent context failures dominate single-agent ones. Hallucinations are correlated with context quality far more strongly than with model size. Constraint decay slots into the same pattern: even GPT-5.2, a frontier model with adequate capability, loses 30 points when it has to recall convention-heavy framework idioms from pre-training instead of having them in context.

The form of the fix matters. Loading every framework doc into the prompt up front blows the window and triggers context rot. Compaction or summarization of framework docs strips the precise syntax the agent needs. The behavior that actually works, and the one the paper points to, is just-in-time retrieval: the agent holds a reference to a framework’s conventions and pulls only the page it needs at the moment it is writing the relevant code. An agent connected to a Wire container does this by holding a container reference and calling wire_search or wire_navigate for the specific convention the current step needs, instead of carrying the whole framework spec as raw context, which keeps the working window in the low thousands of tokens even when the container holds an entire framework reference.

The contrast is not Wire versus no Wire. It is framework conventions as a queryable, structured resource versus framework conventions as something the model has to half-remember from pre-training. Constraint decay measures the cost of the second approach.

What this means for production agents

For end-users, the EURECOM team’s framing is blunt: agents are reliable for rapid prototyping but unreliable for production-grade backend development. The 8.3% pass@1 number on the hardest configuration is the one to take seriously. Even when an agent’s average score looks healthy, the probability of a clean run that satisfies every constraint is in the single digits.

Three things follow.

First, prototype with unconstrained generation and harden manually. L0 is where current agents shine. Treat the agent’s first pass as a scaffold, not a deployable artifact, and add structural compliance through review or a separate pass. The team’s own feature-implementation tasks, where agents had to extend an existing constrained codebase, showed similarly low scores: only GPT-5.2 cleared 50% pass@1, and even then on a single run. The greenfield result is not an artifact of asking agents to start from scratch.

Second, match the framework to the agent, not the team. A team might prefer FastAPI for its ergonomics, but an agent shipping FastAPI code is operating in a framework that costs 25 to 30 points of accuracy before any other constraint is added. For agent-led work, Flask, Express, or Koa give you a 25-plus-point head start. This is uncomfortable advice if the team has standardized on a convention-heavy framework, but the numbers do not budge.

Third, treat framework conventions as retrievable context, not pre-training. Whether through MCP, a vector store, or a structured container of framework docs, the agent should be able to look up the exact convention it needs rather than guess. The other context engineering techniques that work in production all share this shape: putting the right context one tool call away rather than hoping it surfaces from the model’s weights.

The deeper point is that the same dynamic that produces hallucinations and agent drift is producing constraint decay. The model has the capability. The context is missing the structure. Until coding agents have a reliable way to query framework conventions on demand, every additional constraint will keep paying the 30-point tax this paper just put a number on.


Sources: Constraint Decay: The Fragility of LLM Agents in Backend Code Generation (arXiv 2605.06445) · HN discussion · Open-source evaluation pipeline · OpenHands · Mini-SWE-Agent

Frequently asked questions

Why does AI coding agent accuracy drop when structural constraints are added?
Convention-heavy structural requirements force the agent to recall framework idioms from pre-training instead of having them in active context. EURECOM's 2026 study measured a 30-percentage-point average drop from baseline to fully specified tasks across capable agents, with database constraints causing roughly 19 points and architecture rules another 9.
Which web frameworks do AI coding agents handle best?
Lightweight, explicit frameworks beat convention-heavy ones by 25 to 32 points. Across the EURECOM benchmark, Express (51.4% avg assertion pass), Koa (50.7%), and Flask (49.3%) led, while FastAPI (24.2%), Django (25.4%), and Hono (18.5%) trailed. Agents struggle when frameworks rely on implicit defaults like auto-discovery, dependency injection, or type-driven validation.
Why are database and ORM constraints the most damaging?
Data-layer defects account for roughly 45% of agent logic failures, split between incorrect query composition and ORM runtime errors. The ORM constraint itself shows a small marginal effect because the cost appears under any database constraint, whether the ORM is specified or not. Hand-rolled SQL fails the same way ORM code does.
Does using a frontier model fix constraint decay?
No. The EURECOM study tested GPT-5.2, GPT-5-mini, MiniMax-M2.5, Kimi-K2.5, and Qwen3-Coder-Next, and every capable configuration lost 17 to 45 points from baseline to full constraints. Frontier models score higher in absolute terms but show the same decay trend. The lever is reducing how much the agent has to recall and increasing how much it can retrieve at the moment of need.
Are AI coding agents safe to use for production backend code?
Not without scaffolding. The strongest L3 configuration (full architecture + database + ORM constraints) reached 78.6% on partial assertions but only 8.3% pass@1, meaning fewer than one run in ten satisfied the full test suite. Agents are reliable for prototyping; production backends need either a review pass, retrieval-augmented framework knowledge, or a framework that costs the agent fewer assumptions.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container