Skip to content

Memoize name_depth to speed up resolution sorting by 3x#654

Open
st0012 wants to merge 1 commit intomainfrom
optimize-name-depth-sorting
Open

Memoize name_depth to speed up resolution sorting by 3x#654
st0012 wants to merge 1 commit intomainfrom
optimize-name-depth-sorting

Conversation

@st0012
Copy link
Member

@st0012 st0012 commented Mar 10, 2026

Summary

  • Pre-compute name_depth for all names into an IdentityHashMap<NameId, u32> cache before sorting, eliminating redundant recursive walks during O(n log n) comparisons
  • Switch sort_by to sort_unstable_by since the full (depth, uri, offset) key provides deterministic ordering without needing stability

Problem

Profiling with samply revealed that 88% of sampled resolution time was spent inside name_depth closures during prepare_units sorting. The function recursively walks parent_scope and nesting chains to compute depth, and was called from the sort comparator on every comparison — recomputing the same depths millions of times with no memoization.

Note: the depth sort is a correctness requirement, not just an optimization. Removing it entirely causes 13 test failures — the resolution loop's made_progress check gives up when children are processed before parents.

Fix

Compute depths once for all names in a single memoized pass, then use O(1) lookups in the sort comparators.

Benchmark

Metric Before After Delta
Resolution 50.2s (79.4%) 16.7s (57.7%) -66.7%
Total 63.2s 29.0s -54.1%
Listing 0.76s 0.80s ~same
Indexing 11.5s 10.8s ~same
Querying 0.72s 0.66s ~same
Memory (RSS) 4863 MB 4833 MB ~same
Declarations 879,648 879,648 identical
Definitions 1,043,725 1,043,725 identical

Resolution went from 50.2s → 16.7s (3x speedup). Total indexing time cut in half. Output is identical (same counts, same orphan rate), confirming correctness.

@st0012 st0012 force-pushed the optimize-name-depth-sorting branch 3 times, most recently from 8f44afe to 9963652 Compare March 10, 2026 22:14
@st0012 st0012 self-assigned this Mar 10, 2026
// When the depth is the same, sort by URI and offset to maintain determinism
definitions.sort_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| {
(Self::name_depth(name_a, names), uri_a, offset_a).cmp(&(Self::name_depth(name_b, names), uri_b, offset_b))
definitions.sort_unstable_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not required for the rest of the PR, but since we can't have 2 identical depth, uri, offset group, we can use unstable sort for a bit of speedup.

// Sort constant references based on their name complexity so that simpler names are always first
const_refs.sort_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| {
(Self::name_depth(name_a, names), uri_a, offset_a).cmp(&(Self::name_depth(name_b, names), uri_b, offset_b))
const_refs.sort_unstable_by(|(_, (name_a, uri_a, offset_a)), (_, (name_b, uri_b, offset_b))| {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above

@st0012 st0012 marked this pull request as ready for review March 10, 2026 23:00
@st0012 st0012 requested a review from a team as a code owner March 10, 2026 23:00
st0012 added a commit that referenced this pull request Mar 11, 2026
## Summary

- Add `[profile.profiling]` to `rust/Cargo.toml` — inherits release
optimizations (LTO, opt-level 3, single codegen unit) with debug symbols
enabled for readable flamegraphs
- Add `.claude/skills/profiling/SKILL.md` — project-specific skill for
profiling the Rubydex indexer

## What the skill covers

- **samply** for interactive CPU flamegraphs (Firefox Profiler in
browser)
- **macOS `sample`** for text-based call trees (non-interactive/agent
use)
- Phase isolation with `--stop-after` and `--stats`
- How to read profiles (self-time, concentration vs. spread, allocation
pressure)
- Memory profiling with `utils/mem-use`
- Before/after comparison workflow with delta tables
- Troubleshooting (permissions, missing debug symbols, run variance)

## Eval results

Ran the skill in a fresh session. It successfully guided profiling
end-to-end and identified the dominant bottleneck:

```
Profiling Results

Summary
┌────────────┬─────────┬───────┐
│   Phase    │  Time   │   %   │
├────────────┼─────────┼───────┤
│ Listing    │ 0.67s   │ 1.3%  │
├────────────┼─────────┼───────┤
│ Indexing   │ 10.0s   │ 19.0% │
├────────────┼─────────┼───────┤
│ Resolution │ 41.3s   │ 78.4% │
├────────────┼─────────┼───────┤
│ Querying   │ 0.70s   │ 1.3%  │
├────────────┼─────────┼───────┤
│ Total      │ 52.7s   │       │
├────────────┼─────────┼───────┤
│ Memory     │ 4756 MB │       │
└────────────┴─────────┴───────┘

The Bottleneck: name_depth in sorting (100% of sampled resolution time)

┌───────────────────────────────────┬──────────────┬───────┐
│             Function              │ Self Samples │   %   │
├───────────────────────────────────┼──────────────┼───────┤
│ name_depth (nesting closure)      │ 12,536       │ 59.6% │
├───────────────────────────────────┼──────────────┼───────┤
│ name_depth (parent_scope closure) │ 5,984        │ 28.5% │
├───────────────────────────────────┼──────────────┼───────┤
│ quicksort internals               │ 1,447        │ 6.9%  │
├───────────────────────────────────┼──────────────┼───────┤
│ memcmp                            │ 884          │ 4.2%  │
└───────────────────────────────────┴──────────────┴───────┘

88% of all sampled time is spent in name_depth, called from sort_by
in prepare_units. The function recursively walks parent_scope and
nesting chains with zero memoization — and the sort calls it
O(n log n) times per comparison.
```

This finding led directly to #654 (3x resolution speedup via memoized
depth computation).
Pre-compute name depths for all names in a single pass before sorting,
eliminating redundant recursive walks during O(n log n) comparisons.

Previously, name_depth was called from the sort comparator for every
comparison, each time recursively walking parent_scope and nesting chains
to the root. With ~880K names and deep hierarchies (up to 130 levels),
this was the dominant bottleneck: 88% of sampled resolution time was
spent in name_depth closures.

The fix computes depths once into an IdentityHashMap<NameId, u32> cache,
then uses direct lookups in the sort comparators. Also switches to
sort_unstable_by since the full (depth, uri, offset) key provides
deterministic ordering without needing stability.
@st0012 st0012 force-pushed the optimize-name-depth-sorting branch from 9963652 to 37fbd24 Compare March 11, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant