Context engineering: the end of prompt engineering
Key takeaway
In benchmarks of TOON vs JSON, TOON's compact encoding looks like a clear win, but a 9,649-experiment study published in February 2026 found TOON cost large language models 38% more tokens on real tasks and added a grep tax that grew to 7.4x at the largest schema sizes. The reason is not the format's design but the tokenizer's merge tables and the model's fluency, both of which are downstream of training data. For most production AI agents on frontier models, JSON remains the safer default until new formats earn their place in training distributions.
TOON (Token-Oriented Object Notation) shipped in November 2025 with a confident pitch: a drop-in replacement for JSON, 30-60% fewer tokens, same lossless data model. The format trended on Hacker News, racked up GitHub stars, and got picked up by every “save money on LLM API bills” blog post on the internet. Then, in February 2026, a researcher named Damon McMillan ran 9,649 controlled experiments across 11 models, four formats, and schemas ranging from 10 to 10,000 tables. TOON used 38% more tokens than the alternatives at the schemas tested, and format choice had no statistically significant effect on aggregate accuracy (chi-squared 2.45, p=0.484).
Both things are true. TOON’s encoding really is more compact than JSON in characters. But character count is not token count, and token count for one prompt is not token cost across a real agent task. The gap between those three quantities is where most format-comparison content goes wrong, and where context engineers lose money in production.
The deeper lesson is the more useful one. A format’s compactness is downstream of the spec. Its real cost in a language model is downstream of the model’s training distribution. If you do not know what the tokenizer’s merge tables look like, and you do not know what the model has actually seen at scale, you are not picking a format. You are rolling dice.
TOON’s reference benchmarks measure encoding cost, not task cost. The toonformat.dev site compares serialized sizes for fixed datasets under OpenAI’s tiktoken (cl100k variant) and reports 30-60% token reductions, with one frequently-cited result hitting 76.4% accuracy on GPT-5 Nano while using 39.9% fewer tokens than JSON. Those numbers are real for what they measure: how many tokens a specific tokenizer needs to encode a given dataset.
That is not what production agents pay for. In the file-native agentic setting McMillan’s paper tests, models navigate multi-file schemas, retrieve relevant rows, and produce structured answers. Total token spend includes the encoded data, the model’s intermediate reasoning, retries when it gets confused, and additional retrieval steps when it cannot find what it needs in one pass. Across 9,649 of those runs, TOON came out 38% above the alternatives in total spend, and a “grep tax” (extra tokens spent navigating the format) grew to roughly 7.4x at the largest schema size tested.
Encoding cost and task cost diverge because the model is part of the cost function. A format that compresses well on disk can still be expensive in a model that has not been trained to read it fluently. That is the gap the McMillan study measures and the TOON benchmarks do not.
Tokenizers are not character-aware. They map common subsequences in their training corpus to single tokens through byte-pair encoding (BPE). JSON’s ergonomics in modern LLMs are a side effect of its dominance in that corpus. The sequences ":, },, ": ", ": [, and dozens more like them, occur so often in scraped web data, GitHub repos, and API responses that BPE training merges them into single tokens. A field declaration like "id": 12345, can encode in five or six tokens because the punctuation is shared.
TOON’s syntax does not have this advantage yet. Its row delimiters, header conventions, and indentation patterns are too new to appear in the merge tables of any production model. Characters that ought to be cheap, like a tab or a separator, become their own tokens because no merge ever absorbed them. The result is a format that looks denser on disk and reads sparser in the model’s vocabulary.
This is the same effect that makes Korean or Thai text cost more tokens than English at lower character counts. The tokenizer was trained on a distribution that under-represented those scripts. Any structured format invented after the tokenizer was last trained inherits the same problem until enough text in that format reaches the next training cycle.
For TOON specifically, the picture is mixed. On uniform tabular data with a tokenizer that happens to merge its row separators reasonably well, encoding stays tight, and the published benchmarks show real savings. On non-uniform data, deeply nested structures, or schemas that drift from TOON’s CSV sweet spot, the savings collapse and often invert. This is not unique to TOON. Any new structured format faces the same penalty. It is a problem with measuring “token efficiency” without naming a tokenizer, a model, and a workload at the same time.
McMillan’s most interesting result is not the encoding overhead. It is the scaling of what he calls the grep tax. At small schemas, all four formats look similar in total task cost. As schemas grow toward 10,000 tables, TOON’s relative cost climbs to roughly 7.4x the next-best format. Encoding alone does not explain that. Encoding cost grows linearly with size. A 7.4x penalty is the model spending tokens on reasoning, retries, and re-retrieval.
Models can grep JSON in their head. Years of training on JSON-shaped tool specs, API responses, and structured outputs mean they have an internalized parser. They know where keys live, how nested objects unfurl, and what a missing comma means. Asked to find users[42].email in a JSON blob, a frontier model navigates it without much intermediate work.
TOON does not have that fluency yet. The model has to read row headers, count indentation levels, and reconcile the shape of the data on the fly. In short tasks the cost is invisible. At schema sizes where the model has to scan, filter, and join across thousands of rows, it pays for that lack of fluency in additional reasoning tokens, and sometimes gives up and asks for the data in a different shape.
This is a familiar pattern. Embeddings trained on web English work badly on legal text until you fine-tune them. Code models trained on Python work badly on Rust until enough Rust appears in training data. Format fluency is the same kind of distributional property. The cost shows up at scale, not in small examples.
The McMillan paper found a related signal in how model tiers respond to retrieval. Frontier-tier models (Claude, GPT, Gemini) gained 2.7% accuracy from file-based context retrieval, while open-source models lost 7.7% on aggregate. The fluency gap between model tiers shows up there too. Smaller open-source models have narrower training distributions and are more sensitive to format and structure choices, which is exactly where you would expect format-of-context to matter most.
A real comparison of context formats needs three coordinates, not one. Reducing it to “fewer tokens per record” is the equivalent of comparing programming languages by line count.
| Coordinate | Question | What it determines |
|---|---|---|
| Format compactness | How many characters per record? | Storage and network cost, not LLM cost |
| Tokenizer fit | How does the model’s BPE merge table handle this format? | Actual token count for fixed input |
| Model fluency | How much of this format was in training data? | Task-total tokens, including reasoning and retries |
TOON optimizes the first coordinate. It is plausibly optimal there. It cannot optimize the other two yet, because it is too new to be in any major tokenizer’s merge tables or any major model’s training distribution. The same will be true of every new structured format until it earns its place in the corpus. ONTO, the columnar notation introduced in arXiv 2604.17512, will face the same problem. So will whatever replaces TOON next year. Format adoption in LLMs runs at the speed of training cycles, not the speed of GitHub stars.
This connects directly to the broader argument we made in structured context vs raw text: structure helps, but only when the model can read the structure you chose. The tokenizer and the training distribution are both part of context engineering, even though most engineers treat them as fixed inputs.
The honest version of the format question is: under what conditions does choosing format X over JSON pay off, given a specific model and workload? Three rules of thumb come out of the McMillan data.
Frontier models have the deepest fluency in JSON, and JSON has no grep tax at scale. The McMillan results found no aggregate accuracy difference between formats on frontier models, which means JSON’s familiarity gives you a free option: predictable token cost, predictable parsing behavior, no surprises in eval. This is the safe default for production agents.
TOON’s design wins where its assumptions hold. Uniform arrays of objects with the same fields, small enough that the model does not have to navigate, served to a tokenizer that happens to merge its separators efficiently. CSV-style log lines, tabular catalog data, and other shapes where rows really are homogeneous can show real token savings. Test on the actual model and tokenizer before committing.
The same study found open-source models penalized 7.7% on file-based retrieval. If you are running smaller models, format choice has a bigger effect, and the safe choice is the format with the most training-data presence. That is JSON, with Markdown a close second for narrative context. The same logic applies to context-budgeting decisions covered in context budgets: small models punish wasted tokens harder than large ones.
This rubric is also why Wire’s context containers return tool-call results as JSON by default. The empirical data favors it for the frontier-tier models that run most production agents, and the grep tax does not bite as containers grow into the thousands of entries. One default, no per-model branching, no surprise token bills when a customer adds a model the team did not benchmark.
None of this is a takedown. TOON is a thoughtful spec with a real benchmark suite and an honest pitch about its sweet spot. For uniform tabular arrays small enough to fit in context without navigation, it does what it advertises. The tokenizer effect will improve as more TOON appears in the open web; at some point in the next training cycle or two, GPT and Claude will start absorbing TOON sequences into their merge tables, and the encoding gap will narrow. The fluency effect will close more slowly because it depends on tokenizers and on enough examples for the model to internalize the format’s grammar at scale.
The mistake is not using TOON. The mistake is treating “format with fewer characters” and “format that costs the model less” as the same question. They are not, and pretending they are will land your token bill somewhere your benchmark did not predict.
Format choice is one of the smaller decisions in context engineering, but the frame applies to all the larger ones. Chunk size, retrieval shape, tool-spec verbosity, prompt structure, even the order of fields in a JSON object: every choice that touches a language model is downstream of what that model was trained on. Knowing the model, including its tokenizer, is part of the work. Treating models as interchangeable boxes of the same size class is how good benchmarks become bad production systems. If you want a longer companion read on how token costs and context choices interact, why your AI costs are a context problem covers the budgeting side of the same trade-off.
Sources: Structured Context Engineering for File-Native Agentic Systems (arXiv 2602.05447) · Token-Oriented Object Notation vs JSON benchmark (arXiv 2603.03306) · ONTO columnar notation (arXiv 2604.17512) · TOON spec and reference implementation · TOON benchmarks at toonformat.dev · InfoQ: TOON aims to cut LLM costs · Hacker News discussion of TOON
Wire transforms your documents into structured, AI-optimized context containers. Upload files, get MCP tools instantly.
Create Your First Container