| title | ML Workbench |
|---|---|
| emoji | 🧠 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
| license | mit |
Tokeniser evidence for llm models, languages cost trade-offs which tries to answer:
If the same meaning is written in English, Arabic, Hindi, Japanese, or Chinese, how much extra cost and context do different tokenizer families impose?
Serves strict multilingual benchmark evidence and scenario modelling, token inspector.
Most model comparisons incorrectly treat tokenisation as invisible plumbing.
The same model can consume very different token counts depending on:
- the language
- the tokeniser family
- the deployable model sitting on top of that tokenizer
Which can affect:
- API cost
- usable context window
- how multilingual traffic scales in production
- Benchmark compares tokenizer families across languages
- Catalog maps tokenizer families to deployable free OpenRouter models
- Scenario Lab turns tokenizer evidence into cost and context trade-offs
- Audit explains formulas, data sources, and provenance rules
- paste text and see how a tokenizer actually splits it
- inspect token IDs, fragmentation, and language efficiency
- compare two free OpenRouter models side by side
- inspect answer quality, reasoning traces, and token usage
The app uses two benchmark lanes on purpose.
- default lane
- local committed FLORES snapshot
- aligned multilingual text
- safe to use for deploy-grade cost and context analysis
- opt-in lane
- live natural-language streaming samples
- more realistic but less controlled
- useful for exploratory tokenizer behavior, not headline scenario estimates
Only Strict Evidence feeds Scenario Lab cost and context projections.
- Relative Token Cost (vs English): how many more tokens a language needs than aligned English for the same meaning
- Text packed into each token: UTF-8 bytes per token; higher means a tokenizer packs more raw text into each token
- Word split rate: how often a tokenizer breaks words into continuation pieces
- Tokens per word / character: a practical proxy for tokenizer fragmentation pressure
- Unique tokens used: how broad the observed token coverage is on the selected benchmark rows
- Start in Benchmark
- compare tokenizer families across your target languages
- Move to Catalog
- see which free deployable models sit on top of those families
- Open Scenario Lab
- test what those token differences mean for monthly cost and context loss
- Use Audit
- verify formulas, source types, and provenance assumptions
- strict lane uses a local committed FLORES-derived multilingual snapshot in
data/strict_parallel/ - streaming lane uses live exploratory text samples
- model availability comes from the local OpenRouter registry and attached free-model mappings
- only free models are used for hosted runtime comparisons
- Artificial Analysis speed data is local snapshot metadata only
- it is supporting context, not the main token-tax evidence
- the hosted app exposes a curated exact-only tokenizer set
- heavier or proxy tokenizer families are intentionally excluded to keep hosted benchmarking stable
The app is intentionally split into a few clear layers so the data flow stays explainable.
app.pybuilds the top-level Gradio shell- it mounts four major product surfaces:
- Token Tax Workbench
- Tokenizer Inspector
- Model Comparison
- Why Tokenizers Matter
workbench/corpora.pyprovides the strict snapshot and the streaming corpus fetch pathworkbench/tokenizer_registry.pyis the source of truth for tokenizer-family metadataworkbench/tokenizer.pyloads and caches tokenizer implementationsworkbench/token_tax.pycomputes benchmark rows, raw rows, scenario projections, and appendicesworkbench/charts.pyturns benchmark and scenario rows into Plotly figuresworkbench/token_tax_ui.pywires those handlers into Gradio tabs, tables, plots, CSV export, and explanatory copy
workbench/model_registry.pymaps tokenizer families to free deployable models, pricing, and benchmark-only speed metadata- Scenario Lab only uses models with acceptable tokenizer provenance for the current policy settings
- Scenario Lab consumes strict benchmark rows, not streaming exploration rows
- the scenario flow translates tokenizer inflation into:
- projected monthly input tokens
- projected monthly cost
- estimated context loss
- primary host: Render
- hosting is stateless
- benchmark persistence comes from repo-tracked snapshots, not runtime disk state
- the Docker image warms the default tokenizer set during build so the first request is faster and less memory-spiky on low-resource hosting
app.py: Gradio shell and comparison tabworkbench/explainer.py: plain-language explainer tab
workbench/token_tax.py: benchmark/scenario computationworkbench/token_tax_ui.py: workbench UI compositionworkbench/charts.py: chart builders
workbench/tokenizer_registry.py: tokenizer family metadataworkbench/model_registry.py: free model mappings and pricing metadataworkbench/corpora.py: corpus sources and snapshot loadingdata/: committed benchmark and telemetry snapshots
review_harness.py: screenshot review harnessscripts/: review, snapshot, and utility scriptstests/: TDD/regression suite
This repo is being set up with portfolio-style quality gates:
rufffor fast lintingmypyon the core Python modulespytestfor regression coverage- screenshot review harness for visual QA
- smoke import for
build_ui()
Representative local commands:
make lint
make typecheck
make test
uv run python -c "from app import build_ui; ui = build_ui(); print(type(ui).__name__)"Requires uv.
make install
make runOpen the local Gradio URL printed in the terminal.
If you want hosted-style model comparison locally:
OPENROUTER_API_KEY=sk-or-... make runRender is the primary hosted target.
make deployRender auto-deploys from main.
- Streaming Exploration is exploratory and should not be treated as aligned multilingual RTC evidence
- speed metadata is benchmark-only supporting context
- tokenizer-to-model mappings are intentionally conservative
- hosted runtime comparison is restricted to free models to keep costs controlled
Supporting material now lives under docs/:
- guides
- research/source notes
- operational deployment notes
- archived workflow/project material