perf(EntryMode::extract_from_bytes): add happy path check#2461
perf(EntryMode::extract_from_bytes): add happy path check#2461Sebastian Thiel (Byron) merged 1 commit intoGitoxideLabs:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b50a7e923
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".
c889fcd to
4931a1e
Compare
Since the position of the space in the entrymode is often (always?) 6, we can add an explicit check for this case and skip some of the operations performed in the loop, making the benchmark a little faster.
afe3753 to
575e957
Compare
Sebastian Thiel (Byron)
left a comment
There was a problem hiding this comment.
Thanks so much, this is a massive 'relative' improvement visible particularly in the iterator case. And that will absolutely benefit the tree-lookup.
Gnuplot not found, using plotters backend
TreeRef() time: [60.083 ns 60.294 ns 60.504 ns]
change: [−3.3324% −2.9038% −2.4809%] (p = 0.00 < 0.05)
Performance has improved.
TreeRefIter() time: [19.623 ns 19.917 ns 20.246 ns]
change: [−42.220% −41.582% −40.877%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild
10 (10.00%) high severe
The above was tested against main with pre-allocation improvements already merged.
In any case, I think it's well-worth the added complexity - this code is performance critical.
6abbe82
into
GitoxideLabs:main
Since the position of the space in the entrymode is often 6, we can add an explicit check for this case and skip some of the operations performed in the loop, making the benchmark a little faster.
Benches (
cargo bench --bench decode-objects -- TreeRef) when compared tomain:Current benchmark tree
Improvement is more marginal (but still present) with a less artificial tree:
Current HEAD^{tree}
Obviously the usefulness of this change relies on two things: the case of index 6 being the space is indeed the happy path (and from what I can find on the internet, that does seem to be the default case), and whether this micro-optimization is worth the increased code complexity.
Additionally, we can skip some subtraction & logic stuff if the octal value is computed immediately and used, which saves a few cycles.
Notes no-longer relevant commit (only improved perf on benchmark, but likely not on usual workloads)
The 2nd improvement (which is independent of the first) is the use of
iter().position()instead ofByteSlice::find_byteindecode::fast_entry. It yielded the following improvements (compared to only the happy-path fix) for meThis large a speedup was actually a little unexpected,
but as indicated in the commit message, I guess there's some "we were blocking the compiler from optimizing/vectorizing for us" that is now removed.. Looking at the compiler output in compiler explorer actually does not support this theory. I'm not entirely sure whatTREElooks like, but perhaps this is just a false positive: the names used in the benchmark are too small to benefit from thememchrimplementation thatfind_byteuses, so using a basic simple loop (which is what theiter().position()compiles to) is faster.