understanding toon format benchmarks
deep dive into toon performance metrics across multiple llm models
understanding toon format benchmarks
comprehensive analysis of toon format performance across 209 data retrieval questions and 4 major llm models. this deep dive explains what the numbers mean and why toon improves both efficiency and accuracy.
benchmark methodology
test design
209 data retrieval questions across 5 categories: - field retrieval (simple lookups) - structure awareness (understanding data shape) - structural validation (detecting inconsistencies) - aggregation (sum, average, count operations) - filtering (conditional queries)
4 llm models tested: - claude haiku (anthropic) - gemini 2.5 flash (google) - gpt-5 nano (openai) - grok 4 (xai)
3 format variations: - json (formatted with indentation) - toon (tabular format) - yaml (for comparison)
each model answered all 209 questions in all 3 formats. accuracy measured by exact match to expected output.
overall results
accuracy comparison
| format | accuracy | tokens used | efficiency (accuracy per 1k tokens) |
|---|---|---|---|
| toon | **73.9%** | 39.6% fewer | **26.9** |
| json | 69.7% | baseline | 22.9 |
| yaml | 65.2% | 12.3% more | 18.6 |
toon achieves highest accuracy while using significantly fewer tokens. this is remarkable - typically optimization trades accuracy for efficiency. toon improves both.
why toon is more accurate
explicit length markers help llms validate responses:
users[3]{id,name}:
1,alice
2,bob
3,charliethe `[3]` tells the llm "expect exactly 3 records." when data is malformed or truncated, llms notice the mismatch.
field declarations reduce ambiguity:
products{sku,name,price}:
A101,widget,29.99declaring fields upfront helps llms understand data structure before parsing values.
category breakdown
field retrieval: 99.6% accuracy
task: "what is user id 2's name?"
| format | accuracy | notes |
|---|---|---|
| toon | 99.6% | near-perfect |
| json | 99.3% | also excellent |
| yaml | 98.8% | slight edge to structured formats |
all formats handle simple lookups well. toon's marginal advantage comes from explicit field ordering - llms can navigate tabular data linearly.
structure awareness: 88.0% accuracy
task: "how many fields does each user record have?"
| format | accuracy |
|---|---|
| toon | **88.0%** |
| json | 83.0% |
| yaml | 79.5% |
toon's field declarations make structure explicit:
users[50]{id,name,email,role}:the llm immediately knows: 50 users, 4 fields each. no parsing required.
structural validation: 70.0% accuracy
task: "does every record have all required fields?"
| format | accuracy |
|---|---|
| toon | **70.0%** |
| json | 50.0% |
| yaml | 45.2% |
this is toon's biggest advantage. explicit structure makes validation straightforward:
users[3]{id,name,email}:
1,alice,alice@example.com
2,bob,bob@example.com
3,charlie, // missing email!the llm can immediately identify the missing value in charlie's record.
aggregation: 54.4% accuracy
task: "what is the average price?"
| format | accuracy |
|---|---|
| toon | **54.4%** |
| json | 48.8% |
| yaml | 42.1% |
aggregation is harder - requires parsing values and computing. toon's tabular format makes numeric columns easier to identify:
products[5]{name,price}:
widget,29.99
gadget,49.99
tool,19.99
device,89.99
item,39.99the `price` column is clearly the second position in each row.
filtering: 56.3% accuracy
task: "list all users with role=admin"
| format | accuracy |
|---|---|
| toon | **56.3%** |
| json | 50.5% |
| yaml | 47.8% |
filtering requires conditional logic. toon's structured rows help:
users[4]{id,name,role}:
1,alice,admin
2,bob,user
3,charlie,admin
4,diana,userrole is consistently position 3. the llm can scan column 3 for "admin."
per-model results
claude haiku
| format | accuracy | tokens |
|---|---|---|
| toon | 76.2% | -41.2% |
| json | 71.3% | baseline |
claude haiku shows the strongest toon advantage. anthropic's models excel at structured reasoning, and toon's explicit structure plays to this strength.
gemini 2.5 flash
| format | accuracy | tokens |
|---|---|---|
| toon | 73.8% | -38.9% |
| json | 69.5% | baseline |
gemini also benefits significantly. google's models handle tabular data well, likely due to training on sheets/tables.
gpt-5 nano
| format | accuracy | tokens |
|---|---|---|
| toon | 72.1% | -39.1% |
| json | 68.2% | baseline |
gpt-5 nano sees solid improvements. openai's smaller models gain from toon's explicit structure.
grok 4
| format | accuracy | tokens |
|---|---|---|
| toon | 73.5% | -39.4% |
| json | 69.8% | baseline |
grok 4 matches the pattern. all models benefit from toon's structured approach.
token reduction analysis
where savings come from
eliminated syntax: - json: `{"key": "value", "key2": "value2"}` - toon: `{key,key2}: value,value2`
removed per record: - 4 quotes per field - 2 braces - commas between fields - colons after keys
for a 50-record array with 4 fields: ~800 tokens saved from syntax alone.
field name deduplication: - json: repeats field names 50 times - toon: declares field names once
50 records × 4 fields × 8 chars average = 1,600 chars saved.
scaling impact
token reduction increases with dataset size:
| records | json tokens | toon tokens | reduction |
|---|---|---|---|
| 10 | 520 | 380 | 26.9% |
| 50 | 2,340 | 1,410 | 39.7% |
| 100 | 4,580 | 2,750 | 40.0% |
| 500 | 22,150 | 13,180 | 40.5% |
reduction plateaus around 40% for large datasets.
accuracy per token efficiency
the key metric: accuracy per 1,000 tokens
| format | accuracy | tokens (avg) | efficiency |
|---|---|---|---|
| toon | 73.9% | 2,749 | **26.9** |
| json | 69.7% | 4,550 | 15.3 |
| yaml | 65.2% | 5,110 | 12.8 |
toon delivers 76% more accuracy per token than json. this means you get better results for less cost.
limitations revealed
deeply nested data
benchmark included nested config objects:
{
"server": {
"database": {
"connection": {"host": "localhost", "port": 5432}
}
}
}results: - json: 380 tokens, 95.2% accuracy - toon: 420 tokens, 94.8% accuracy
toon's indentation overhead exceeds json's compact nesting. toon accuracy slightly lower due to whitespace confusion.
conclusion: use json for deeply nested structures.
non-uniform arrays
benchmark tested mixed-field arrays:
[
{"id": 1, "name": "alice"},
{"id": 2, "name": "bob", "email": "bob@example.com"}
]results: - json: 180 tokens, 87.3% accuracy - toon: 195 tokens, 85.1% accuracy
toon can't use tabular format effectively. falls back to verbose representation. json wins on both metrics.
conclusion: use json for non-uniform data.
practical implications
when to use toon
based on benchmarks, toon excels when:
1. uniform data (80%+ consistent fields) 2. medium-large datasets (50+ records) 3. validation matters (need to detect missing fields) 4. cost-sensitive (token efficiency important)
expected performance
typical production workloads can expect: - 35-45% token reduction - 2-5% accuracy improvement on structured queries - 10-20% better validation of data integrity
optimal use cases
the benchmarks validate these use cases: - employee directories - product catalogs - analytics data - time-series logs - api response caching
conclusion
toon's benchmarks demonstrate something unusual: improved efficiency and accuracy simultaneously. the format's explicit structure helps llms understand and validate data while using fewer tokens.
the 73.9% accuracy vs 69.7% for json, achieved with 39.6% fewer tokens, represents a significant improvement for structured data queries. these aren't theoretical gains - they're measured across 209 real-world questions and 4 production llm models.
for applications processing uniform structured data at scale, toon delivers measurable benefits. the benchmarks provide the evidence to justify adoption.