back to blog
analysisNov 15, 202510 min read

understanding toon format benchmarks

deep dive into toon performance metrics across multiple llm models

benchmarksperformance analysisllm accuracytoken efficiencydata retrievaltechnical analysis

understanding toon format benchmarks

comprehensive analysis of toon format performance across 209 data retrieval questions and 4 major llm models. this deep dive explains what the numbers mean and why toon improves both efficiency and accuracy.

benchmark methodology

test design

209 data retrieval questions across 5 categories: - field retrieval (simple lookups) - structure awareness (understanding data shape) - structural validation (detecting inconsistencies) - aggregation (sum, average, count operations) - filtering (conditional queries)

4 llm models tested: - claude haiku (anthropic) - gemini 2.5 flash (google) - gpt-5 nano (openai) - grok 4 (xai)

3 format variations: - json (formatted with indentation) - toon (tabular format) - yaml (for comparison)

each model answered all 209 questions in all 3 formats. accuracy measured by exact match to expected output.

overall results

accuracy comparison

formataccuracytokens usedefficiency (accuracy per 1k tokens)
toon**73.9%**39.6% fewer**26.9**
json69.7%baseline22.9
yaml65.2%12.3% more18.6

toon achieves highest accuracy while using significantly fewer tokens. this is remarkable - typically optimization trades accuracy for efficiency. toon improves both.

why toon is more accurate

explicit length markers help llms validate responses:

users[3]{id,name}:
  1,alice
  2,bob
  3,charlie

the `[3]` tells the llm "expect exactly 3 records." when data is malformed or truncated, llms notice the mismatch.

field declarations reduce ambiguity:

products{sku,name,price}:
  A101,widget,29.99

declaring fields upfront helps llms understand data structure before parsing values.

category breakdown

field retrieval: 99.6% accuracy

task: "what is user id 2's name?"

formataccuracynotes
toon99.6%near-perfect
json99.3%also excellent
yaml98.8%slight edge to structured formats

all formats handle simple lookups well. toon's marginal advantage comes from explicit field ordering - llms can navigate tabular data linearly.

structure awareness: 88.0% accuracy

task: "how many fields does each user record have?"

formataccuracy
toon**88.0%**
json83.0%
yaml79.5%

toon's field declarations make structure explicit:

users[50]{id,name,email,role}:

the llm immediately knows: 50 users, 4 fields each. no parsing required.

structural validation: 70.0% accuracy

task: "does every record have all required fields?"

formataccuracy
toon**70.0%**
json50.0%
yaml45.2%

this is toon's biggest advantage. explicit structure makes validation straightforward:

users[3]{id,name,email}:
  1,alice,alice@example.com
  2,bob,bob@example.com
  3,charlie,  // missing email!

the llm can immediately identify the missing value in charlie's record.

aggregation: 54.4% accuracy

task: "what is the average price?"

formataccuracy
toon**54.4%**
json48.8%
yaml42.1%

aggregation is harder - requires parsing values and computing. toon's tabular format makes numeric columns easier to identify:

products[5]{name,price}:
  widget,29.99
  gadget,49.99
  tool,19.99
  device,89.99
  item,39.99

the `price` column is clearly the second position in each row.

filtering: 56.3% accuracy

task: "list all users with role=admin"

formataccuracy
toon**56.3%**
json50.5%
yaml47.8%

filtering requires conditional logic. toon's structured rows help:

users[4]{id,name,role}:
  1,alice,admin
  2,bob,user
  3,charlie,admin
  4,diana,user

role is consistently position 3. the llm can scan column 3 for "admin."

per-model results

claude haiku

formataccuracytokens
toon76.2%-41.2%
json71.3%baseline

claude haiku shows the strongest toon advantage. anthropic's models excel at structured reasoning, and toon's explicit structure plays to this strength.

gemini 2.5 flash

formataccuracytokens
toon73.8%-38.9%
json69.5%baseline

gemini also benefits significantly. google's models handle tabular data well, likely due to training on sheets/tables.

gpt-5 nano

formataccuracytokens
toon72.1%-39.1%
json68.2%baseline

gpt-5 nano sees solid improvements. openai's smaller models gain from toon's explicit structure.

grok 4

formataccuracytokens
toon73.5%-39.4%
json69.8%baseline

grok 4 matches the pattern. all models benefit from toon's structured approach.

token reduction analysis

where savings come from

eliminated syntax: - json: `{"key": "value", "key2": "value2"}` - toon: `{key,key2}: value,value2`

removed per record: - 4 quotes per field - 2 braces - commas between fields - colons after keys

for a 50-record array with 4 fields: ~800 tokens saved from syntax alone.

field name deduplication: - json: repeats field names 50 times - toon: declares field names once

50 records × 4 fields × 8 chars average = 1,600 chars saved.

scaling impact

token reduction increases with dataset size:

recordsjson tokenstoon tokensreduction
1052038026.9%
502,3401,41039.7%
1004,5802,75040.0%
50022,15013,18040.5%

reduction plateaus around 40% for large datasets.

accuracy per token efficiency

the key metric: accuracy per 1,000 tokens

formataccuracytokens (avg)efficiency
toon73.9%2,749**26.9**
json69.7%4,55015.3
yaml65.2%5,11012.8

toon delivers 76% more accuracy per token than json. this means you get better results for less cost.

limitations revealed

deeply nested data

benchmark included nested config objects:

{
  "server": {
    "database": {
      "connection": {"host": "localhost", "port": 5432}
    }
  }
}

results: - json: 380 tokens, 95.2% accuracy - toon: 420 tokens, 94.8% accuracy

toon's indentation overhead exceeds json's compact nesting. toon accuracy slightly lower due to whitespace confusion.

conclusion: use json for deeply nested structures.

non-uniform arrays

benchmark tested mixed-field arrays:

[
  {"id": 1, "name": "alice"},
  {"id": 2, "name": "bob", "email": "bob@example.com"}
]

results: - json: 180 tokens, 87.3% accuracy - toon: 195 tokens, 85.1% accuracy

toon can't use tabular format effectively. falls back to verbose representation. json wins on both metrics.

conclusion: use json for non-uniform data.

practical implications

when to use toon

based on benchmarks, toon excels when:

1. uniform data (80%+ consistent fields) 2. medium-large datasets (50+ records) 3. validation matters (need to detect missing fields) 4. cost-sensitive (token efficiency important)

expected performance

typical production workloads can expect: - 35-45% token reduction - 2-5% accuracy improvement on structured queries - 10-20% better validation of data integrity

optimal use cases

the benchmarks validate these use cases: - employee directories - product catalogs - analytics data - time-series logs - api response caching

conclusion

toon's benchmarks demonstrate something unusual: improved efficiency and accuracy simultaneously. the format's explicit structure helps llms understand and validate data while using fewer tokens.

the 73.9% accuracy vs 69.7% for json, achieved with 39.6% fewer tokens, represents a significant improvement for structured data queries. these aren't theoretical gains - they're measured across 209 real-world questions and 4 production llm models.

for applications processing uniform structured data at scale, toon delivers measurable benefits. the benchmarks provide the evidence to justify adoption.

back to all articles