docs(benchmarks): add generation benchmarks by vetertann · Pull Request #239 · toon-format/toon

vetertann · 2025-12-05T18:04:31Z

Linked Issue

Closes #207

Description

This PR adds Generation Benchmarks section to the documentation. It details the performance of TOON compared to JSON and JSON Structured Output (JSO) across 21 different LLMs, focusing on token efficiency, accuracy, and repair capabilities.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test coverage improvement

Changes Made

Added ## 2. Generation benchmarks section to docs/guide/benchmarks.md.
Documented methodology
Added performance tables comparing 1-shot accuracy, final accuracy (after repair loops), and token budgets across 21 models .
Included qualitative analysis on the accuracy, repair loops, and token efficiency scaling.

SPEC Compliance

This PR implements/fixes spec compliance
Spec section(s) affected: N/A (Documentation only)
Spec version: N/A

Testing

All existing tests pass
Added new tests for changes
Tests cover edge cases and spec compliance

Pre-submission Checklist

My code follows the project's coding standards
I have run code formatting/linting tools (Markdown linting)
I have added tests that prove my fix/feature works
New and existing tests pass locally
I have updated documentation if needed
I have reviewed the TOON specification for relevant sections

Breaking Changes

No breaking changes
Breaking changes (describe migration path below)

Additional Context

Benchmarks were run via the Nebius API.

johannschopplich · 2025-12-05T19:36:46Z

Hi there!
Thanks for the benchmark results. But the benchmark docs are auto-generated from the internal benchmarks. Also, I don't intend to add benchmarks without the code to reproduce them.

Can you please enhance my benchmarks package with your code and share the tool results? As a hint, the final generation result that gets embedded in benchmarks.md is generated by scripts/accuracy-benchmark.ts.

vetertann · 2025-12-05T19:47:21Z

Oh, ok... I did this PR just because in your comment to the issue #207 you wrote:
"Once you're happy with the setup and have stable results, I'd definitely be interested in:

A write‑up or summary table we can link to, and
If you're up for it, a PR that adds a short "generation benchmarks" section under docs/guide/benchmarks (even if the harness itself stays in your Python repo and we just describe the methodology and link out)."

johannschopplich · 2025-12-05T19:49:39Z

I see, sorry, missed that. Could you add the generation benchmarks (tho in Python, no problem) to this repo as well? For the sake of reproducibility? Thanks.

vetertann · 2025-12-05T20:00:11Z

Sure, I’ll open a PR adding it under benchmarks/generation

johannschopplich · 2026-05-17T16:38:15Z

This was generated by AI during triage.

The doc section is well-written and the results are compelling. Holding the merge for one reason. We'd like benchmarks documented here to be reproducible from this repo, and the harness behind these numbers lives only in your external Python repo (see the close note on #240).

Once a TS port of the harness lands, this section ships alongside it. Marking as draft so it doesn't pile up review-ready signals in the meantime – please flip back to ready when the port is in.

Thanks for the rigorous methodology on the harness side.

docs(benchmarks): add generation benchmarks summary

ff4678c

johannschopplich mentioned this pull request May 17, 2026

feat(benchmarks): add generation benchmark #240

Closed

21 tasks

johannschopplich mentioned this pull request May 17, 2026

TOON benchmark for generation tasks #207

Closed

johannschopplich marked this pull request as draft May 17, 2026 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(benchmarks): add generation benchmarks#239

docs(benchmarks): add generation benchmarks#239
vetertann wants to merge 1 commit into
toon-format:mainfrom
vetertann:docs/generation-benchmarks

vetertann commented Dec 5, 2025 •

edited

Loading

Uh oh!

johannschopplich commented Dec 5, 2025 •

edited

Loading

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented Dec 5, 2025

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vetertann commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked Issue

Description

Type of Change

Changes Made

SPEC Compliance

Testing

Pre-submission Checklist

Breaking Changes

Additional Context

Uh oh!

johannschopplich commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented Dec 5, 2025

Uh oh!

vetertann commented Dec 5, 2025

Uh oh!

johannschopplich commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vetertann commented Dec 5, 2025 •

edited

Loading

johannschopplich commented Dec 5, 2025 •

edited

Loading