Skip to content

Commit

Permalink
Add figures to README
Browse files Browse the repository at this point in the history
  • Loading branch information
hendrikvanantwerpen committed Oct 1, 2024
1 parent 81a119f commit eaf4f7f
Show file tree
Hide file tree
Showing 7 changed files with 30 additions and 4 deletions.
10 changes: 10 additions & 0 deletions crates/bpe/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Ignore benchmark results except figures references in the README.
# Negated ignore patterns do not work for files inside a directory that is itself ignored.
# Therefore ignore using `**` and then negate the nested directories (but not the files inside).
/benches/result/**
!/benches/result/*/
!/benches/result/*/*/
# Negate the actual figures we want to keep.
!/benches/result/reports/counting-o200k/lines.svg
!/benches/result/reports/encoding-o200k/lines.svg
!/benches/result/reports/appending-o200k/lines.svg
18 changes: 18 additions & 0 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,21 @@ As can be seen, our Backtracking implementation beats the TikToken Rust implemen
And even the fully dynamic programming solution is faster with a more consistent runtime.
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.

### Counting results

Results for counting o200k tokens for random 10000 byte slices. The setup time of the interval encoder is comparable to backtracking. After setup counting of slices of the original data are approximately constant time.

![Counting o200k tokens for random 10000 byte slices](./benches/result/reports/counting-o200k/lines.svg)

### Encoding results

Results for encoding o200k tokens for random 1000 bytes. The backtracking encoder consistently outperforms tiktoken by a constant factor.

![Encoding o200k tokens for 10000 random bytes](./benches/result/reports/encoding-o200k/lines.svg)

### Incremental encoding results

Results for incrementally encoding o200k tokens by appending 10000 random bytes. The appending encoder is slower by a constant factor but overall has similar performance curve as the backtracking encoder encoding all data at once.

![Incrementally encoding o200k tokens by appending 10000 random bytes](./benches/result/reports/appending-o200k/lines.svg)
4 changes: 1 addition & 3 deletions crates/bpe/benches/performance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -140,9 +140,7 @@ fn appending_benchmark(c: &mut Criterion) {
AppendableEncoder::new(bpe),
)
},
|(start, mut enc)| {
enc.extend(input[start..start + bytes].into_iter().copied())
},
|(start, mut enc)| enc.extend(input[start..start + bytes].into_iter().copied()),
criterion::BatchSize::SmallInput,
)
});
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion crates/bpe/criterion.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# save report in this directory, even if a custom target directory is set
criterion_home = "./target/criterion"
criterion_home = "./benches/result"

0 comments on commit eaf4f7f

Please sign in to comment.