Skip to content

perf(EntryMode::extract_from_bytes): add happy path check#2461

Merged
Sebastian Thiel (Byron) merged 1 commit intoGitoxideLabs:mainfrom
datdenkikniet:mini-optimize
Mar 22, 2026
Merged

perf(EntryMode::extract_from_bytes): add happy path check#2461
Sebastian Thiel (Byron) merged 1 commit intoGitoxideLabs:mainfrom
datdenkikniet:mini-optimize

Conversation

@datdenkikniet
Copy link
Contributor

@datdenkikniet datdenkikniet commented Mar 7, 2026

Since the position of the space in the entrymode is often 6, we can add an explicit check for this case and skip some of the operations performed in the loop, making the benchmark a little faster.

Benches (cargo bench --bench decode-objects -- TreeRef) when compared to main:

Current benchmark tree
TreeRef()               time:   [91.447 ns 91.539 ns 91.631 ns]
                        change: [−4.4580% −4.3152% −4.1764%] (p = 0.00 < 0.05)
                        Performance has improved.

TreeRefIter()           time:   [34.566 ns 34.611 ns 34.661 ns]
                        change: [−15.910% −15.735% −15.567%] (p = 0.00 < 0.05)
                        Performance has improved.

Improvement is more marginal (but still present) with a less artificial tree:

Current HEAD^{tree}
TreeRef()               time:   [1.0033 µs 1.0041 µs 1.0050 µs]
                        change: [−9.9484% −9.7732% −9.5428%] (p = 0.00 < 0.05)
                        Performance has improved.

TreeRefIter()           time:   [899.42 ns 899.95 ns 900.56 ns]
                        change: [−4.7541% −4.5380% −4.3125%] (p = 0.00 < 0.05)
                        Performance has improved.

Obviously the usefulness of this change relies on two things: the case of index 6 being the space is indeed the happy path (and from what I can find on the internet, that does seem to be the default case), and whether this micro-optimization is worth the increased code complexity.

Additionally, we can skip some subtraction & logic stuff if the octal value is computed immediately and used, which saves a few cycles.

Notes no-longer relevant commit (only improved perf on benchmark, but likely not on usual workloads)

The 2nd improvement (which is independent of the first) is the use of iter().position() instead of ByteSlice::find_byte in decode::fast_entry. It yielded the following improvements (compared to only the happy-path fix) for me

TreeRef()               time:   [84.198 ns 84.299 ns 84.405 ns]
                        change: [−13.030% −12.865% −12.676%] (p = 0.00 < 0.05)
                        Performance has improved.

TreeRefIter()           time:   [26.710 ns 26.887 ns 27.067 ns]
                        change: [−35.780% −35.469% −35.121%] (p = 0.00 < 0.05)
                        Performance has improved.

This large a speedup was actually a little unexpected, but as indicated in the commit message, I guess there's some "we were blocking the compiler from optimizing/vectorizing for us" that is now removed.. Looking at the compiler output in compiler explorer actually does not support this theory. I'm not entirely sure what TREE looks like, but perhaps this is just a false positive: the names used in the benchmark are too small to benefit from the memchr implementation that find_byte uses, so using a basic simple loop (which is what the iter().position() compiles to) is faster.

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b50a7e923

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "Codex (@codex) address that feedback".

@datdenkikniet datdenkikniet force-pushed the mini-optimize branch 3 times, most recently from c889fcd to 4931a1e Compare March 7, 2026 10:40
Since the position of the space in the entrymode is often
(always?) 6, we can add an explicit check for this case
and skip some of the operations performed in the loop,
making the benchmark a little faster.
Copy link
Member

@Byron Sebastian Thiel (Byron) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, this is a massive 'relative' improvement visible particularly in the iterator case. And that will absolutely benefit the tree-lookup.

Gnuplot not found, using plotters backend
TreeRef()               time:   [60.083 ns 60.294 ns 60.504 ns]
                        change: [−3.3324% −2.9038% −2.4809%] (p = 0.00 < 0.05)
                        Performance has improved.

TreeRefIter()           time:   [19.623 ns 19.917 ns 20.246 ns]
                        change: [−42.220% −41.582% −40.877%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe

The above was tested against main with pre-allocation improvements already merged.

In any case, I think it's well-worth the added complexity - this code is performance critical.

@Byron Sebastian Thiel (Byron) merged commit 6abbe82 into GitoxideLabs:main Mar 22, 2026
55 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants