Skip to content

Commit eaf4f7f

Browse files
Add figures to README
1 parent 81a119f commit eaf4f7f

File tree

7 files changed

+30
-4
lines changed

7 files changed

+30
-4
lines changed

crates/bpe/.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Ignore benchmark results except figures references in the README.
2+
# Negated ignore patterns do not work for files inside a directory that is itself ignored.
3+
# Therefore ignore using `**` and then negate the nested directories (but not the files inside).
4+
/benches/result/**
5+
!/benches/result/*/
6+
!/benches/result/*/*/
7+
# Negate the actual figures we want to keep.
8+
!/benches/result/reports/counting-o200k/lines.svg
9+
!/benches/result/reports/encoding-o200k/lines.svg
10+
!/benches/result/reports/appending-o200k/lines.svg

crates/bpe/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,3 +198,21 @@ As can be seen, our Backtracking implementation beats the TikToken Rust implemen
198198
And even the fully dynamic programming solution is faster with a more consistent runtime.
199199
The tuned heap implementation is still quite competitive to TikToken (especially for smaller inputs).
200200
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
201+
202+
### Counting results
203+
204+
Results for counting o200k tokens for random 10000 byte slices. The setup time of the interval encoder is comparable to backtracking. After setup counting of slices of the original data are approximately constant time.
205+
206+
![Counting o200k tokens for random 10000 byte slices](./benches/result/reports/counting-o200k/lines.svg)
207+
208+
### Encoding results
209+
210+
Results for encoding o200k tokens for random 1000 bytes. The backtracking encoder consistently outperforms tiktoken by a constant factor.
211+
212+
![Encoding o200k tokens for 10000 random bytes](./benches/result/reports/encoding-o200k/lines.svg)
213+
214+
### Incremental encoding results
215+
216+
Results for incrementally encoding o200k tokens by appending 10000 random bytes. The appending encoder is slower by a constant factor but overall has similar performance curve as the backtracking encoder encoding all data at once.
217+
218+
![Incrementally encoding o200k tokens by appending 10000 random bytes](./benches/result/reports/appending-o200k/lines.svg)

crates/bpe/benches/performance.rs

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -140,9 +140,7 @@ fn appending_benchmark(c: &mut Criterion) {
140140
AppendableEncoder::new(bpe),
141141
)
142142
},
143-
|(start, mut enc)| {
144-
enc.extend(input[start..start + bytes].into_iter().copied())
145-
},
143+
|(start, mut enc)| enc.extend(input[start..start + bytes].into_iter().copied()),
146144
criterion::BatchSize::SmallInput,
147145
)
148146
});

crates/bpe/benches/result/reports/appending-o200k/lines.svg

Loading

crates/bpe/benches/result/reports/counting-o200k/lines.svg

Loading

crates/bpe/benches/result/reports/encoding-o200k/lines.svg

Loading

crates/bpe/criterion.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# save report in this directory, even if a custom target directory is set
2-
criterion_home = "./target/criterion"
2+
criterion_home = "./benches/result"

0 commit comments

Comments
 (0)