librz/util/vector: minor RzVector/RzPVector performance optimizations#6467
Open
notxvilka wants to merge 5 commits into
Open
librz/util/vector: minor RzVector/RzPVector performance optimizations#6467notxvilka wants to merge 5 commits into
notxvilka wants to merge 5 commits into
Conversation
vector_quick_sort allocated its two element-sized scratch buffers (t and pivot) with malloc/free on every recursive call. For a vector of n elements the sort makes O(n) recursive calls, i.e. O(n) malloc/free pairs purely for scratch space, and each call could also fail half-way through the sort. Split the function into a small entry point that allocates the two buffers once and a recursive worker that receives them as scratch. The buffers are reused across the whole recursion (each partition step finishes using them before recursing, and the recursion is sequential, so sharing one pair is safe). Small elements -- the common case, including every RzPVector-backed sort -- use stack buffers and allocate nothing at all; only elements larger than 256 bytes fall back to a single heap allocation for the whole sort. The element movement and rand()-based pivot selection are unchanged, so the result is identical for any input (verified byte-for-byte against the previous implementation for ascending and descending orders over many random arrays).
The partition loop tested the element against the pivot with two separate
calls to the comparator:
if ((cmp(VEC_INDEX(a, i), pivot, user) < 0 && !reverse) ||
(cmp(VEC_INDEX(a, i), pivot, user) > 0 && reverse)) {
Because cmp is an opaque function pointer the compiler cannot common up the
two calls, so depending on the result and the reverse flag the comparator was
invoked up to twice per element. Compute the result once into a local and test
that:
int c = cmp(VEC_INDEX(a, i), pivot, user);
if ((c < 0 && !reverse) || (c > 0 && reverse)) {
This halves comparator calls in the worst case and is a clear win whenever the
comparator is non-trivial (the common case for struct elements). Measured on a
shared host: ~12-14% faster for int sorting and ~30% faster with a moderately
expensive comparator. The ordering is unchanged (verified byte-for-byte).
The index of the located slot was computed as
size_t index = (el - (void **)vec->v.a) * sizeof(void **) / vec->v.elem_size;
For an RzPVector the element size is always sizeof(void *), so the
`* sizeof(void **) / vec->v.elem_size` factor is identically 1 and the pointer
difference `el - (void **)vec->v.a` already yields the index directly. Drop the
redundant scaling, which removes a multiply and a divide and makes the intent
clear. Behaviour is unchanged.
The existing sort tests only sort 4-5 small elements and there was no test for
rz_pvector_remove_data. Add coverage for the code paths exercised by the sort
changes and the remove_data cleanup:
- test_vector_sort_large sort 2000 heavily-duplicated ut32 values
ascending and descending, verifying the result
is ordered and a permutation of the input (vs a
reference qsort). Drives the recursion deeply
and the shared scratch buffers.
- test_vector_sort_large_elem sort 400 elements of 304 bytes each, taking the
heap-allocated scratch fallback, and check the
full payload (not just the key) stays consistent
through all the element moves.
- test_pvector_remove_data remove interior, first and last elements by
value while preserving order, and confirm
removing an absent value is a no-op.
All pass on both the previous and the optimized implementation (the sort and
remove_data changes are behaviour-preserving).
bench_vector.c benchmarked only remove_at and swap. Add sort benchmarks so the
suite covers the functions touched by the sort optimizations and can be run
against the old and new librz for before/after numbers:
- rz_vector_sort over 4k ut64 with a cheap comparator
- rz_vector_sort over 4k ut64 with a deliberately expensive comparator
(shows the effect of evaluating the comparator once per element)
- rz_pvector_sort over 4k pointers (reference; pvector sort is unchanged)
Each iteration refills the buffer from an unsorted master copy via a single
memcpy before sorting; that overhead is identical across builds so the measured
delta reflects the sort.
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 14 files with indirect coverage changes Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Your checklist for this pull request
RZ_APIfunction and struct this PR changes.RZ_API).Detailed description
Only self-contained and easy to review changes with measurable benefits.
This PR speeds up
rz_vector_sortand removes a per-call allocation and a redundantdouble comparator-evaluation from the quicksort hot loop. It also simplifies the index
math in
rz_pvector_remove_data. All changes are behaviour-preserving — the sortproduces byte-identical output to the previous implementation for the same input.
rz_vector_sortis ~1.3–1.4× faster for small/medium elements, driven mainly bycalling the comparator once per element instead of up to twice. Large (> 256-byte)
elements are dominated by per-element
memcpyand are essentially unchanged.rz_vector_sort, int, cheap comparatorrz_vector_sort, int, expensive comparatorrz_vector_sort, 304-byte elements (heap scratch)rz_pvector_sortTest plan
Closing issues
Partially addresses #6096