analysisNov 15, 202510 min read

understanding toon format benchmarks

deep dive into toon performance metrics across multiple llm models

benchmarksperformance analysisllm accuracytoken efficiencydata retrievaltechnical analysis

understanding toon format benchmarks

comprehensive analysis of toon format performance across 209 data retrieval questions and 4 major llm models. this deep dive explains what the numbers mean and why toon improves both efficiency and accuracy.

benchmark methodology

test design

209 data retrieval questions across 5 categories: - field retrieval (simple lookups) - structure awareness (understanding data shape) - structural validation (detecting inconsistencies) - aggregation (sum, average, count operations) - filtering (conditional queries)

4 llm models tested: - claude haiku (anthropic) - gemini 2.5 flash (google) - gpt-5 nano (openai) - grok 4 (xai)

3 format variations: - json (formatted with indentation) - toon (tabular format) - yaml (for comparison)

each model answered all 209 questions in all 3 formats. accuracy measured by exact match to expected output.

overall results

accuracy comparison

format	accuracy	tokens used	efficiency (accuracy per 1k tokens)
toon	73.9%	39.6% fewer	26.9
json	69.7%	baseline	22.9
yaml	65.2%	12.3% more	18.6

toon achieves highest accuracy while using significantly fewer tokens. this is remarkable - typically optimization trades accuracy for efficiency. toon improves both.

why toon is more accurate

explicit length markers help llms validate responses:

users[3]{id,name}:
  1,alice
  2,bob
  3,charlie

the `[3]` tells the llm "expect exactly 3 records." when data is malformed or truncated, llms notice the mismatch.

field declarations reduce ambiguity:

products{sku,name,price}:
  A101,widget,29.99

declaring fields upfront helps llms understand data structure before parsing values.

category breakdown

field retrieval: 99.6% accuracy

task: "what is user id 2's name?"

format	accuracy	notes
toon	99.6%	near-perfect
json	99.3%	also excellent
yaml	98.8%	slight edge to structured formats

all formats handle simple lookups well. toon's marginal advantage comes from explicit field ordering - llms can navigate tabular data linearly.

structure awareness: 88.0% accuracy

task: "how many fields does each user record have?"

format	accuracy
toon	88.0%
json	83.0%
yaml	79.5%

toon's field declarations make structure explicit:

users[50]{id,name,email,role}:

the llm immediately knows: 50 users, 4 fields each. no parsing required.

structural validation: 70.0% accuracy

task: "does every record have all required fields?"

format	accuracy
toon	70.0%
json	50.0%
yaml	45.2%

this is toon's biggest advantage. explicit structure makes validation straightforward:

users[3]{id,name,email}:
  1,alice,alice@example.com
  2,bob,bob@example.com
  3,charlie,  // missing email!

the llm can immediately identify the missing value in charlie's record.

aggregation: 54.4% accuracy

task: "what is the average price?"

format	accuracy
toon	54.4%
json	48.8%
yaml	42.1%

aggregation is harder - requires parsing values and computing. toon's tabular format makes numeric columns easier to identify:

products[5]{name,price}:
  widget,29.99
  gadget,49.99
  tool,19.99
  device,89.99
  item,39.99

the `price` column is clearly the second position in each row.

filtering: 56.3% accuracy

task: "list all users with role=admin"

format	accuracy
toon	56.3%
json	50.5%
yaml	47.8%

filtering requires conditional logic. toon's structured rows help:

users[4]{id,name,role}:
  1,alice,admin
  2,bob,user
  3,charlie,admin
  4,diana,user

role is consistently position 3. the llm can scan column 3 for "admin."

per-model results

claude haiku

format	accuracy	tokens
toon	76.2%	-41.2%
json	71.3%	baseline

claude haiku shows the strongest toon advantage. anthropic's models excel at structured reasoning, and toon's explicit structure plays to this strength.

gemini 2.5 flash

format	accuracy	tokens
toon	73.8%	-38.9%
json	69.5%	baseline

gemini also benefits significantly. google's models handle tabular data well, likely due to training on sheets/tables.

gpt-5 nano

format	accuracy	tokens
toon	72.1%	-39.1%
json	68.2%	baseline

gpt-5 nano sees solid improvements. openai's smaller models gain from toon's explicit structure.

grok 4

format	accuracy	tokens
toon	73.5%	-39.4%
json	69.8%	baseline

grok 4 matches the pattern. all models benefit from toon's structured approach.

token reduction analysis

where savings come from

eliminated syntax: - json: `{"key": "value", "key2": "value2"}` - toon: `{key,key2}: value,value2`

removed per record: - 4 quotes per field - 2 braces - commas between fields - colons after keys

for a 50-record array with 4 fields: ~800 tokens saved from syntax alone.

field name deduplication: - json: repeats field names 50 times - toon: declares field names once

50 records × 4 fields × 8 chars average = 1,600 chars saved.

scaling impact

token reduction increases with dataset size:

records	json tokens	toon tokens	reduction
10	520	380	26.9%
50	2,340	1,410	39.7%
100	4,580	2,750	40.0%
500	22,150	13,180	40.5%

reduction plateaus around 40% for large datasets.

accuracy per token efficiency

the key metric: accuracy per 1,000 tokens

format	accuracy	tokens (avg)	efficiency
toon	73.9%	2,749	26.9
json	69.7%	4,550	15.3
yaml	65.2%	5,110	12.8

toon delivers 76% more accuracy per token than json. this means you get better results for less cost.

limitations revealed

deeply nested data

benchmark included nested config objects:

{
  "server": {
    "database": {
      "connection": {"host": "localhost", "port": 5432}
    }
  }
}

results: - json: 380 tokens, 95.2% accuracy - toon: 420 tokens, 94.8% accuracy

toon's indentation overhead exceeds json's compact nesting. toon accuracy slightly lower due to whitespace confusion.

conclusion: use json for deeply nested structures.

non-uniform arrays

benchmark tested mixed-field arrays:

[
  {"id": 1, "name": "alice"},
  {"id": 2, "name": "bob", "email": "bob@example.com"}
]

results: - json: 180 tokens, 87.3% accuracy - toon: 195 tokens, 85.1% accuracy

toon can't use tabular format effectively. falls back to verbose representation. json wins on both metrics.

conclusion: use json for non-uniform data.

practical implications

when to use toon

based on benchmarks, toon excels when:

1. uniform data (80%+ consistent fields) 2. medium-large datasets (50+ records) 3. validation matters (need to detect missing fields) 4. cost-sensitive (token efficiency important)

expected performance

typical production workloads can expect: - 35-45% token reduction - 2-5% accuracy improvement on structured queries - 10-20% better validation of data integrity

optimal use cases

the benchmarks validate these use cases: - employee directories - product catalogs - analytics data - time-series logs - api response caching

conclusion

toon's benchmarks demonstrate something unusual: improved efficiency and accuracy simultaneously. the format's explicit structure helps llms understand and validate data while using fewer tokens.

the 73.9% accuracy vs 69.7% for json, achieved with 39.6% fewer tokens, represents a significant improvement for structured data queries. these aren't theoretical gains - they're measured across 209 real-world questions and 4 production llm models.

for applications processing uniform structured data at scale, toon delivers measurable benefits. the benchmarks provide the evidence to justify adoption.