Add worstcase benchmark #23

hendrikvanantwerpen · 2024-10-03T16:53:57Z

Adds a benchmark that brings out the worst in tiktoken.

I tried to understand how the splitting regexes actually work, but that was kind of annoying.
Just selecting ranges from the whole Unicode set excluding whitespace seems to do the trick though.

If we want to ensure it really is the worst case, I can try again to understand the regexes, or we just reject inputs that can be split using the regex (assuming most cannot).

I've also added the graph for this benchmark to the readme.

hendrikvanantwerpen added 2 commits October 3, 2024 18:47

Add correctness test for o200k

79f0a02

Add worst-case benchmark and add results to README

af3f23c

hendrikvanantwerpen self-assigned this Oct 3, 2024

hendrikvanantwerpen requested a review from aneubeck October 3, 2024 16:53

aneubeck approved these changes Oct 4, 2024

View reviewed changes

hendrikvanantwerpen merged commit 1d02b2e into main Oct 4, 2024
3 checks passed

hendrikvanantwerpen deleted the add-worstcase-benchmark branch October 4, 2024 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worstcase benchmark #23

Add worstcase benchmark #23

hendrikvanantwerpen commented Oct 3, 2024

Add worstcase benchmark #23

Add worstcase benchmark #23

Conversation

hendrikvanantwerpen commented Oct 3, 2024