Skip to content

Commit 67dbca7

Browse files
committed
Add profiling skill for CPU flamegraphs and performance analysis
Project-specific skill that guides profiling the Rubydex indexer using samply (interactive Firefox Profiler) and macOS sample (text-based). Covers build setup, phase isolation, flamegraph analysis, memory profiling, and before/after comparison workflows.
1 parent d7ffc08 commit 67dbca7

1 file changed

Lines changed: 272 additions & 0 deletions

File tree

.claude/skills/profiling/SKILL.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
---
2+
name: profiling
3+
description: >
4+
Profile Rubydex indexer performance — CPU flamegraphs, memory usage, phase-level timing.
5+
Use this skill whenever the user mentions profiling, performance, flamegraphs, benchmarking,
6+
"why is X slow", bottlenecks, hot paths, memory usage, or wants to understand where time
7+
is spent during indexing/resolution. Also trigger when comparing performance before/after
8+
a change.
9+
---
10+
11+
# Profiling Rubydex
12+
13+
This skill helps you profile the Rubydex indexer to find CPU and memory bottlenecks.
14+
The indexer has a multi-phase pipeline (listing → indexing → resolution → querying), and
15+
resolution typically dominates (~80% on large codebases). Profiling helps you see *inside*
16+
each phase to find what's actually expensive.
17+
18+
## Profiling tool: samply
19+
20+
Use **samply** — a sampling profiler that opens results in Firefox Profiler (in-browser).
21+
It captures call stacks at high frequency and produces interactive flamegraphs with filtering,
22+
timeline views, and per-function cost breakdowns.
23+
24+
Install if needed:
25+
26+
```bash
27+
cargo install samply
28+
```
29+
30+
## Build profile
31+
32+
Profiling needs optimized code *with* debug symbols so you get real function names in the
33+
flamegraph instead of mangled addresses. The workspace Cargo.toml has a custom profile for this:
34+
35+
```toml
36+
# rust/Cargo.toml
37+
[profile.profiling]
38+
inherits = "release"
39+
debug = true # Full debug symbols for readable flamegraphs
40+
strip = false # Keep symbols in the binary
41+
```
42+
43+
If this profile doesn't exist yet, **add it** to `rust/Cargo.toml` before profiling. The
44+
release profile uses `lto = true`, `opt-level = 3`, `codegen-units = 1` — the profiling
45+
profile inherits all of that and just adds debug info.
46+
47+
Build with:
48+
49+
```bash
50+
cargo build --profile profiling
51+
```
52+
53+
The binary lands at `rust/target/profiling/rubydex_cli` (not `target/release/`).
54+
55+
The first build is slow (LTO + single codegen unit must recompile everything). Subsequent
56+
builds after small changes are faster since Cargo caches intermediate artifacts in
57+
`target/profiling/`. Don't delete that directory between runs.
58+
59+
## Running a profile
60+
61+
### Full pipeline
62+
63+
```bash
64+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
65+
```
66+
67+
The `--stats` flag prints the timing breakdown and memory stats to stderr after completion,
68+
so you get both the samply profile AND the summary stats in one run.
69+
70+
Useful samply flags:
71+
- `--no-open` — don't auto-open the browser (useful for scripted runs)
72+
- `--save-only` — save the profile to disk without starting the local server; load later
73+
with `samply load <profile.json>`
74+
75+
### Isolating a phase
76+
77+
Use `--stop-after` to profile only up to a specific stage. This is useful when you want
78+
a cleaner flamegraph focused on one phase without the noise of later stages:
79+
80+
```bash
81+
# Profile only listing + indexing (skip resolution)
82+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats --stop-after indexing
83+
84+
# Profile through resolution (skip querying)
85+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats --stop-after resolution
86+
```
87+
88+
Valid `--stop-after` values: `listing`, `indexing`, `resolution`.
89+
90+
### Common target paths
91+
92+
The user should have a `DEFAULT_BENCH_WORKSPACE` configured pointing to a target codebase.
93+
94+
For synthetic corpora, use `utils/bench` with size arguments (tiny/small/medium/large/huge),
95+
which auto-generates corpora at `../rubydex_corpora/<size>/`.
96+
97+
## Reading the results
98+
99+
When samply finishes, it automatically opens Firefox Profiler in the browser. Key things
100+
to guide the user through:
101+
102+
### Firefox Profiler tips
103+
104+
1. **Call Tree tab** — shows cumulative time per function, sorted by total cost. Start here
105+
to find the most expensive call paths.
106+
107+
2. **Flame Graph tab** — visual representation where width = time. Look for wide bars — those
108+
are the hot functions. Click to zoom in.
109+
110+
3. **Timeline** — shows activity over time. Useful for spotting if one phase is unexpectedly
111+
long or if there are idle gaps.
112+
113+
4. **Filtering** — type a function name in the filter box to isolate it. Useful for focusing
114+
on resolution internals like `resolve_all`, `resolve_name`, `linearize_ancestors`.
115+
116+
5. **Transform > Focus on subtree** — right-click a function to see only its callees. Perfect
117+
for drilling into `resolution` to see what's inside that 50s.
118+
119+
6. **Transform > Merge function** — collapse recursive calls to see aggregate cost.
120+
121+
### Text-based profiling with `sample` (macOS)
122+
123+
When you can't interact with a browser (e.g., running from a script or agent), use macOS's
124+
built-in `sample` command for a text-based call tree:
125+
126+
```bash
127+
# Start the indexer in the background, then sample it
128+
rust/target/profiling/rubydex_cli <TARGET_PATH> --stats &
129+
PID=$!
130+
sample $PID -f /tmp/rubydex-sample.txt
131+
wait $PID
132+
```
133+
134+
Or for a simpler approach, sample for a fixed duration while the indexer runs:
135+
136+
```bash
137+
rust/target/profiling/rubydex_cli <TARGET_PATH> --stats &
138+
PID=$!
139+
sleep 2 # let it get past listing/indexing into resolution
140+
sample $PID 30 -f /tmp/rubydex-sample.txt # sample for 30 seconds
141+
wait $PID
142+
```
143+
144+
The output is a text call tree with sample counts — sort by "self" samples to find hot functions.
145+
146+
### How to read the profile
147+
148+
Don't assume which functions are hot — let the data tell you. Hot paths change as the
149+
codebase evolves.
150+
151+
1. **Sort by self-time** (time spent in the function itself, not its callees). This reveals
152+
the actual hot spots. High total-time but low self-time means the function is just a
153+
caller — drill into its children.
154+
155+
2. **Look for concentration vs. spread.** A single function dominating self-time suggests
156+
an algorithmic fix (memoization, better data structure). Time spread across many functions
157+
suggests the workload itself is large and optimization requires a different approach.
158+
159+
3. **Check for allocation pressure.** If `alloc` / `malloc` / `realloc` show up prominently
160+
in self-time, the bottleneck is memory allocation, not computation.
161+
162+
## Memory profiling
163+
164+
For memory, the `--stats` flag already reports Maximum RSS at the end. For deeper memory
165+
analysis:
166+
167+
### Quick check with utils/mem-use
168+
169+
```bash
170+
utils/mem-use rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
171+
```
172+
173+
This wraps the command with `/usr/bin/time -l` and reports:
174+
- Maximum Resident Set Size (RSS)
175+
- Peak Memory Footprint
176+
- Execution Time
177+
178+
### Allocation profiling with DHAT
179+
180+
For finding *where* memory is allocated (not just total), use DHAT (requires nightly):
181+
182+
```bash
183+
cargo +nightly build --profile profiling -Z build-std --target aarch64-apple-darwin
184+
```
185+
186+
This is more involved — only suggest it if the user specifically wants allocation-level detail.
187+
188+
## Before/after comparison workflow
189+
190+
When the user has made a change and wants to measure impact:
191+
192+
1. **Get baseline** — run on the current main/branch before changes:
193+
```bash
194+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats 2>&1 | tee /tmp/rubydex-baseline.txt
195+
```
196+
Save the samply profile URL from the browser (Firefox Profiler allows sharing via permalink).
197+
198+
2. **Apply changes** and rebuild:
199+
```bash
200+
cargo build --profile profiling
201+
```
202+
203+
3. **Get new measurement**:
204+
```bash
205+
samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats 2>&1 | tee /tmp/rubydex-after.txt
206+
```
207+
208+
4. **Compare** — parse both output files and show a side-by-side delta of:
209+
- Total time and per-phase breakdown (listing, indexing, resolution, querying)
210+
- Memory (RSS)
211+
- Declaration/definition counts (sanity check that output is equivalent)
212+
213+
Present the comparison as a formatted table showing absolute values and % change.
214+
215+
### Quick benchmark (no flamegraph)
216+
217+
When the user just wants timing/memory numbers without the full profiler overhead:
218+
219+
```bash
220+
# Release build (faster than profiling profile since no debug symbols)
221+
cargo build --release
222+
utils/bench # uses DEFAULT_BENCH_WORKSPACE
223+
utils/bench medium # synthetic corpus
224+
utils/bench /path/to/project # specific directory
225+
```
226+
227+
## Timing phases (--stats output)
228+
229+
The `--stats` flag on rubydex_cli prints a timing breakdown using the internal timer system.
230+
The phases are:
231+
232+
| Phase | What it measures |
233+
|-------|-----------------|
234+
| Initialization | Setup and configuration |
235+
| Listing | File discovery (walking directories, filtering .rb files) |
236+
| Indexing | Parsing Ruby files and extracting definitions/references |
237+
| Resolution | Computing fully qualified names, resolving constants, linearizing ancestors |
238+
| Integrity check | Validating graph consistency (optional) |
239+
| Querying | Building query indices |
240+
| Cleanup | Time not accounted for by other phases |
241+
242+
It also prints:
243+
- Maximum RSS in bytes and MB
244+
- Declaration/definition counts and breakdown by kind
245+
- Orphan rate (definitions not linked to declarations)
246+
247+
## Troubleshooting
248+
249+
### samply permission errors on macOS
250+
251+
samply uses the `dtrace` backend on macOS which may need elevated permissions. If you get
252+
permission errors:
253+
254+
```bash
255+
sudo samply record rust/target/profiling/rubydex_cli <TARGET_PATH> --stats
256+
```
257+
258+
Or grant Terminal/iTerm the "Developer Tools" permission in System Settings > Privacy & Security.
259+
260+
### Empty or unhelpful flamegraphs
261+
262+
If the flamegraph shows mostly `[unknown]` frames:
263+
- Make sure you built with `--profile profiling` (not `--release`)
264+
- Verify debug symbols: `dsymutil -s rust/target/profiling/rubydex_cli | head -20` should
265+
show symbol entries.
266+
- On macOS, ensure `strip = false` in the profiling profile
267+
268+
### Comparing runs with variance
269+
270+
Indexer performance can vary ±5% between runs due to OS scheduling, file system caching, etc.
271+
For reliable comparisons, run 3 times and take the median, or at minimum run twice and check
272+
consistency before drawing conclusions.

0 commit comments

Comments
 (0)