Memoize name_depth to speed up resolution sorting by 3x#654
Open
Memoize name_depth to speed up resolution sorting by 3x#654
Conversation
8f44afe to
9963652
Compare
st0012
commented
Mar 10, 2026
| // When the depth is the same, sort by URI and offset to maintain determinism | ||
| definitions.sort_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| { | ||
| (Self::name_depth(name_a, names), uri_a, offset_a).cmp(&(Self::name_depth(name_b, names), uri_b, offset_b)) | ||
| definitions.sort_unstable_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| { |
Member
Author
There was a problem hiding this comment.
This change is not required for the rest of the PR, but since we can't have 2 identical depth, uri, offset group, we can use unstable sort for a bit of speedup.
st0012
commented
Mar 10, 2026
| // Sort constant references based on their name complexity so that simpler names are always first | ||
| const_refs.sort_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| { | ||
| (Self::name_depth(name_a, names), uri_a, offset_a).cmp(&(Self::name_depth(name_b, names), uri_b, offset_b)) | ||
| const_refs.sort_unstable_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| { |
st0012
added a commit
that referenced
this pull request
Mar 11, 2026
## Summary - Add `[profile.profiling]` to `rust/Cargo.toml` — inherits release optimizations (LTO, opt-level 3, single codegen unit) with debug symbols enabled for readable flamegraphs - Add `.claude/skills/profiling/SKILL.md` — project-specific skill for profiling the Rubydex indexer ## What the skill covers - **samply** for interactive CPU flamegraphs (Firefox Profiler in browser) - **macOS `sample`** for text-based call trees (non-interactive/agent use) - Phase isolation with `--stop-after` and `--stats` - How to read profiles (self-time, concentration vs. spread, allocation pressure) - Memory profiling with `utils/mem-use` - Before/after comparison workflow with delta tables - Troubleshooting (permissions, missing debug symbols, run variance) ## Eval results Ran the skill in a fresh session. It successfully guided profiling end-to-end and identified the dominant bottleneck: ``` Profiling Results Summary ┌────────────┬─────────┬───────┐ │ Phase │ Time │ % │ ├────────────┼─────────┼───────┤ │ Listing │ 0.67s │ 1.3% │ ├────────────┼─────────┼───────┤ │ Indexing │ 10.0s │ 19.0% │ ├────────────┼─────────┼───────┤ │ Resolution │ 41.3s │ 78.4% │ ├────────────┼─────────┼───────┤ │ Querying │ 0.70s │ 1.3% │ ├────────────┼─────────┼───────┤ │ Total │ 52.7s │ │ ├────────────┼─────────┼───────┤ │ Memory │ 4756 MB │ │ └────────────┴─────────┴───────┘ The Bottleneck: name_depth in sorting (100% of sampled resolution time) ┌───────────────────────────────────┬──────────────┬───────┐ │ Function │ Self Samples │ % │ ├───────────────────────────────────┼──────────────┼───────┤ │ name_depth (nesting closure) │ 12,536 │ 59.6% │ ├───────────────────────────────────┼──────────────┼───────┤ │ name_depth (parent_scope closure) │ 5,984 │ 28.5% │ ├───────────────────────────────────┼──────────────┼───────┤ │ quicksort internals │ 1,447 │ 6.9% │ ├───────────────────────────────────┼──────────────┼───────┤ │ memcmp │ 884 │ 4.2% │ └───────────────────────────────────┴──────────────┴───────┘ 88% of all sampled time is spent in name_depth, called from sort_by in prepare_units. The function recursively walks parent_scope and nesting chains with zero memoization — and the sort calls it O(n log n) times per comparison. ``` This finding led directly to #654 (3x resolution speedup via memoized depth computation).
Pre-compute name depths for all names in a single pass before sorting, eliminating redundant recursive walks during O(n log n) comparisons. Previously, name_depth was called from the sort comparator for every comparison, each time recursively walking parent_scope and nesting chains to the root. With ~880K names and deep hierarchies (up to 130 levels), this was the dominant bottleneck: 88% of sampled resolution time was spent in name_depth closures. The fix computes depths once into an IdentityHashMap<NameId, u32> cache, then uses direct lookups in the sort comparators. Also switches to sort_unstable_by since the full (depth, uri, offset) key provides deterministic ordering without needing stability.
9963652 to
37fbd24
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
name_depthfor all names into anIdentityHashMap<NameId, u32>cache before sorting, eliminating redundant recursive walks during O(n log n) comparisonssort_bytosort_unstable_bysince the full(depth, uri, offset)key provides deterministic ordering without needing stabilityProblem
Profiling with
samplyrevealed that 88% of sampled resolution time was spent insidename_depthclosures duringprepare_unitssorting. The function recursively walksparent_scopeandnestingchains to compute depth, and was called from the sort comparator on every comparison — recomputing the same depths millions of times with no memoization.Note: the depth sort is a correctness requirement, not just an optimization. Removing it entirely causes 13 test failures — the resolution loop's
made_progresscheck gives up when children are processed before parents.Fix
Compute depths once for all names in a single memoized pass, then use O(1) lookups in the sort comparators.
Benchmark
Resolution went from 50.2s → 16.7s (3x speedup). Total indexing time cut in half. Output is identical (same counts, same orphan rate), confirming correctness.