File-native agents: when reading files backfires

Jitpal Kocher · · 10 min read

Key takeaway

File-native agents that read and grep files for context improve accuracy by 2.7% on frontier models like Claude and GPT but lose 7.7% on open-source models, according to a 9,649-experiment study published in February 2026. The reason is that file-native retrieval makes the model do its own navigation, and weaker models spend their budget getting lost instead of answering. The practical rule is to let capable models retrieve their own context and to inject pre-selected context directly for smaller ones.

Give an AI agent a filesystem and tell it to find what it needs, and it will grep, read, and follow references the way a developer would. This file-native pattern is everywhere now: AGENTS.md and CLAUDE.md files, repository context the agent reads on demand, schema files it navigates instead of receiving up front. The intuition is that more access produces better answers. A 9,649-experiment study published in February 2026 found that intuition is right for frontier models and wrong for everything else.

The study, run by Damon McMillan across 11 models and schemas ranging from 10 to 10,000 tables, compared two ways of getting structured context into an agent: schema injection, where the relevant data is placed directly in the prompt, and file-native access, where the agent retrieves it itself through grep and read operations. The headline most coverage took from the paper was about format, that YAML, JSON, and Markdown perform about the same. The more useful result was about architecture, and it splits hard along model capability.

File-native retrieval helps frontier models and hurts open-source ones

File-native retrieval improved frontier-model accuracy by 2.7% (p=0.029) and reduced open-source accuracy by 7.7% on aggregate (p<0.001). Those are the two numbers that matter from the McMillan study, and they point in opposite directions. The same architectural choice that makes a capable model better makes a weaker one worse, which means there is no single right answer to “should my agent read files or be handed context?”

The two architectures differ in who does the retrieval work. Schema injection front-loads it: you, or a retrieval layer, decide what the model sees and place it in the prompt. File-native access defers it to the model: you give it tools and a filesystem, and it navigates to what it needs. For a frontier model, deferral is a feature, because the model can plan a search, read selectively, and avoid drowning in irrelevant context. For a smaller model, deferral is a trap, because every grep it issues and every file it opens is a decision it can get wrong.

What “file-native” means in practice

File-native context is any setup where the agent retrieves its own context from files rather than receiving it pre-selected in the prompt. In the coding-agent world this is the AGENTS.md pattern: a project drops instructions and pointers into a file, and the agent reads, greps the repository, and follows references to assemble what it needs. In the data-agent world it is exactly what McMillan tested: a database schema sits on disk as files, and the agent runs read and grep operations to find the tables relevant to a query.

The study used database schemas rather than coding repositories, so the AGENTS.md connection is an architectural analogy, not a direct measurement. But the bet is identical in both cases. You are wagering that the model can navigate a body of context more effectively than a retrieval system can pre-select it. That wager is a function of the model, which is why the same design choice produces a 2.7% gain in one tier and a 7.7% loss in another. This is the same problem we described in why AI coding assistants can’t see your codebase: access to files is not the same as the ability to use them well.

Why the split happens: navigation is a capability tax

The reason file-native retrieval backfires on weaker models is that navigation consumes the same capability budget the model needs for the actual task. Reading the right file requires knowing which file is right, issuing a useful grep requires predicting what the answer looks like, and stopping at the right moment requires recognizing when enough has been gathered. Each of those is a small reasoning step, and a smaller model spends a larger share of its limited budget on them.

A frontier model treats navigation as cheap. It has internalized enough structure from training that finding users.email across a partitioned schema, or locating the relevant config in a repository, costs it almost nothing in reasoning. A weaker model treats the same navigation as expensive, and the cost shows up as wrong turns, redundant reads, and answers assembled from the wrong context. This is the same fluency effect that drives the grep tax we documented for unfamiliar formats, applied one level up: not fluency in a syntax, but fluency in the act of retrieval itself.

This is also why “give the agent more tools and let it figure it out” is not a universal upgrade. As covered in tool calling is retrieval, every tool call is a retrieval decision the model has to get right, and decisions are not free for models that are short on capability to begin with.

The model gap dwarfs the format and architecture choices

The single largest effect in the study was not format or architecture, it was the model. McMillan measured a 21 percentage point accuracy gap between the frontier tier and the open-source tier, larger than any format or retrieval-architecture effect in the data. Format choice, by contrast, showed no significant aggregate difference (chi-squared 2.45, p=0.484), with YAML at 75.4%, Markdown at 74.9%, and JSON and TOON at 72.3%.

The practical reading is a hierarchy of decisions. Picking the model tier is the first-order choice and determines most of the outcome. Picking the retrieval architecture, file-native versus injection, is the second-order choice and only matters once the tier is fixed. Picking the structured context format, YAML versus JSON versus Markdown, is the third-order choice and, on frontier models, barely matters at all. Teams routinely invert this order, spending days debating file format while treating the model as interchangeable. The data says the model is where the accuracy lives.

DecisionEffect size in the studyWhen it matters
Model tier21 percentage pointsAlways. Dominates everything else
Retrieval architecture+2.7% frontier, -7.7% open-sourceOnce the tier is chosen. Sign flips by tier
Context formatNo significant aggregate effect (p=0.484)Marginal on frontier models; larger on weak ones

When file-native design is the right call

For capable models, file-native retrieval has one decisive advantage: it scales past the context window. The study found file-native agents navigating up to 10,000 tables while keeping navigation accuracy high, a scale that simple schema injection cannot reach because the full schema would not fit in the prompt. The condition was that the schema be partitioned by domain, so the model could narrow its search to a relevant slice rather than scanning everything.

This is the legitimate case for file-native design, and the argument for it is stronger than the failure numbers suggest once a few conditions hold. The baseline reason is capacity: when the body of context is larger than the context window, you have no choice but to let the model retrieve, and on a frontier model that retrieval stays accurate if the structure supports narrowing. But capacity alone is not what makes file-native the right call. Two properties of the context decide that.

The first is that the context is stable or agent-managed rather than churning underneath the model. A codebase is the canonical example: the files change constantly, but the agent is the one changing them, so it holds a working model of where things live and what to read next. Static reference material, slowly-versioned schemas, and documentation the agent maintains itself all share this property. When context shifts faster than the agent can track, file-native retrieval degrades, because the map the model built no longer matches what is on disk.

The second is that what the agent needs to find is predictable from its input. File-native retrieval is the model guessing where to look, and that guess is cheap only when the query points at the answer. “Fix the failing test in the auth module” tells a coding agent exactly which files to open. A vague task against an unfamiliar schema does not, and the model spends its budget exploring instead of answering. Add the structural condition the study isolated, partition by domain so the search space narrows instead of forcing a linear scan, and file-native access stops being a gamble on raw model capability. The model is navigating a space it understands toward a target it can predict, and that is a context engineering decision, not a model decision.

What to do: match retrieval architecture to model tier

The actionable rule from the study is to stop treating file-native access as a default and start treating it as a model-dependent choice. Capable models should retrieve their own context; weaker models should be handed pre-selected context. Three guidelines follow from the data.

Let frontier models retrieve, with partitioned structure

If you are running Claude, GPT, or Gemini against a large body of context, give the model retrieval tools and partition the underlying structure by domain. This is where file-native access pays its 2.7% and scales to thousands of tables. The work is in the partitioning, not in the prompt.

Inject pre-selected context for smaller and open-source models

If you are running a smaller or open-source model, do the retrieval for it. Run the search in a retrieval layer, select the relevant slice, and place that slice directly in the prompt. The study’s 7.7% penalty is the cost of making these models navigate, and you avoid it by not asking them to. This connects to the budgeting argument in context budgets: weaker models punish wasted navigation tokens harder than strong ones do.

Move navigation out of the model when you can

A third option sits between the two: keep the scaling benefit of external context but take the navigation burden off the model entirely by exposing structured query tools instead of raw files. An agent connected to a Wire context container retrieves through navigate and search tools rather than greping files, which moves the navigation work the study found costs open-source models 7.7% out of the model and into the tool. The model asks for what it wants; the container decides which entries to return. That keeps file-native scale without betting on the model’s ability to grep.

Closing thought

The file-native pattern feels like a strict upgrade because access feels like capability. The McMillan data is a clean reminder that access is only useful to a model that can act on it. A frontier model handed a filesystem behaves like a senior engineer dropped into a new codebase: it finds what it needs. A weaker model handed the same filesystem behaves like an intern with no map: it reads the wrong files and reports the wrong answer with confidence. Before you decide how your agent gets its context, decide which of those two you are building on, because that choice determines whether reading files helps or hurts.


Sources: Structured Context Engineering for File-Native Agentic Systems (arXiv 2602.05447) · Simon Willison on file-native agentic systems · TOON vs JSON: why smaller doesn’t mean cheaper

Frequently asked questions

Should I use file-native retrieval or schema injection for AI agents?
Use file-native retrieval, where the agent greps and reads files itself, for frontier models like Claude, GPT, and Gemini, which gained 2.7% accuracy from it in the McMillan study. Use schema injection, where you place the relevant context directly in the prompt, for smaller or open-source models, which lost 7.7% on aggregate when forced to navigate files themselves.
Why do open-source models perform worse with file-based context retrieval?
File-native retrieval requires the model to plan searches, issue greps, read results, and decide what to read next, which is a capability tax. Smaller models spend their token budget navigating instead of answering, and the McMillan study measured a 7.7% aggregate accuracy drop for open-source models given that burden.
Does AGENTS.md improve coding agent accuracy?
It depends on the model. AGENTS.md is the file-native pattern applied to coding agents, asking the model to read and navigate repository context on its own. The same study's logic applies: capable frontier models benefit, while weaker models do better when you hand them a curated slice of context rather than a large file to navigate.
How many tables can a file-native agent navigate?
The study showed file-native agents scaling to 10,000 tables while maintaining navigation accuracy, but only when the schema was partitioned by domain so the model could narrow its search. Without partitioning, navigation accuracy degrades as the search space grows.

Ready to give your AI agents better context?

Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.

Create Your First Container