Skip to content

fix(pampa): detect parse errors without the per-lex tree-sitter logger (bd-b7eb7)#247

Merged
cscheid merged 2 commits into
mainfrom
perf/tree-sitter-error-logging
May 31, 2026
Merged

fix(pampa): detect parse errors without the per-lex tree-sitter logger (bd-b7eb7)#247
cscheid merged 2 commits into
mainfrom
perf/tree-sitter-error-logging

Conversation

@cscheid
Copy link
Copy Markdown
Member

@cscheid cscheid commented May 31, 2026

Summary

The qmd reader (crates/pampa/src/readers/qmd.rs::read) attached a tree-sitter set_logger callback on every parse. When any logger is set, tree-sitter formats a debug string with snprintf for every lex action — and on macOS snprintf locks the global locale (localeconv_l). Under multithreaded project renders (rayon, ~16 workers) that locale lock serializes all parser threads.

How it was found

Profiled a large healthy project (401 docs / 8.2 MB, no errors) — the case we most want fast. A fresh multithreaded render spent ~75% of samples in __ulock_wait2/__ulock_wake/os_unfair_lock, call chain:

ts_lexer__advance → snprintf → __vfprintf → localeconv_l → os_unfair_lock

16 threads bought only ~1.2× wall-clock; sys time ~37 s. It is not the allocator — swapping mimalloc both as #[global_allocator] and via tree_sitter::set_allocator changed nothing. The locale-lock culprit was found with macOS sample (which symbolicates system libraries, unlike samply's sidecar).

Fix

Parse the happy path with no logger, and detect whether tree-sitter entered error-correction mode via its own public signal — the parse_with_options progress callback's ParseState::has_error — combined with the final tree's Node::has_error (which backstops an error in the last few ops before parse end). The logger is attached only for the diagnostic re-parse, which runs only when the document has an error. Encapsulated in MarkdownTree::had_parse_error. The previous, ineffective TreeSitterLogObserverFast (its cheap callback didn't help — the snprintf is inside tree-sitter, before the callback) is removed.

Why has_error is reliable here: tree-sitter-qmd declares no conflicts:, so the parser is deterministic LR and GLR multiple-stack speculation only occurs during error recovery — has_error therefore means a genuine error, not a speculative branch that will be pruned. This invariant is now documented in grammar.js and parser.rs.

Results (large healthy project, release-perf)

metric before after
16-thread render 8.04 s 3.71 s (2.2×)
16-thread sys 36.83 s 1.37 s (27×)
pass-1 @ 16 threads 2935 ms 426 ms (scales ~6.4× on 16 cores)
single-thread render 9.71 s 6.0 s (1.6×)

Correctness

  • Error-corpus snapshot tests (json / text / ariadne / locations) pass.
  • The real 565-file qmd-plans corpus reports the same 140 error files — exact parity with the previous log-based detection.
  • Full cargo xtask verify (Rust workspace + WASM/hub build + hub tests) passes.

Notes

  • Also documents the profiling lesson (use macOS sample for system-lib frames; a Rust #[global_allocator] doesn't cover a vendored C library's libc malloc) in claude-notes/instructions/performance-profiling.md.
  • Full investigation: claude-notes/research/2026-05-31-q2-render-perf-qmd-plans.md (lands separately with the website experiment).

🤖 Generated with Claude Code

cscheid and others added 2 commits May 31, 2026 09:31
…r (bd-b7eb7)

The qmd reader attached a tree-sitter `set_logger` callback on every parse. When
any logger is set, tree-sitter formats a debug string with `snprintf` for every
lex action; on macOS `snprintf` locks the global locale (`localeconv_l`). Under
multithreaded project renders (rayon, ~16 workers) that locale lock serializes
all parser threads.

Profiled on a large *healthy* project (401 docs / 8.2 MB, no errors): a fresh
multithreaded render spent ~75% of samples in __ulock_wait2/__ulock_wake/
os_unfair_lock — chain ts_lexer__advance -> snprintf -> __vfprintf ->
localeconv_l. 16 threads bought only ~1.2x wall-clock; sys ~37s. (It is NOT the
allocator: mimalloc, both as #[global_allocator] and via ts_set_allocator,
changed nothing. Identified with macOS `sample`, which symbolicates system libs
unlike samply's sidecar.)

Fix: parse the happy path with no logger, and detect whether tree-sitter entered
error-correction mode via its own public signal — the `parse_with_options`
progress callback's `ParseState::has_error` — combined with the final tree's
`Node::has_error` (the latter backstops an error in the last few ops before the
parse ends). The logger is attached only for the diagnostic re-parse, which runs
only when the document has an error. `MarkdownTree::had_parse_error` encapsulates
this; `TreeSitterLogObserverFast` (the previous, ineffective optimization) is
removed.

Why has_error is reliable here: tree-sitter-qmd declares no `conflicts:`, so the
parser is deterministic LR and GLR multiple-stack speculation only happens during
error recovery — has_error therefore means a genuine error, not a speculative
branch. Documented this invariant in grammar.js and parser.rs so it isn't
regressed.

Measured (large healthy project, release-perf): 16-thread render 8.04 -> 3.71s
(2.2x); sys 36.83 -> 1.37s (27x); pass-1 2935 -> 426ms (now scales ~6.4x on 16
cores); single-thread 1.6x faster. No diagnostic regression: error-corpus
snapshot tests pass and the real 565-file corpus reports the same 140 error
files. Full `cargo xtask verify` (incl. WASM/hub) passes.

Research notes: claude-notes/research/2026-05-31-q2-render-perf-qmd-plans.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…b7eb7)

samply's syms sidecar doesn't symbolicate system libraries, so hot/lock-holding
frames inside libsystem show up as bare addresses. Document using macOS `sample`
to resolve them, with the bd-b7eb7 worked example (guessed malloc, was wrong;
mimalloc swap as a decisive negative result; the real cause was snprintf ->
localeconv_l, the locale lock). Also note that #[global_allocator] doesn't cover
a vendored C library's direct libc malloc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cscheid
Copy link
Copy Markdown
Member Author

cscheid commented May 31, 2026

(it should be noted that these wall-clock numbers come from a really beefy M5 Max; typical users will see slower perf. We're working on it!)

@cscheid cscheid merged commit 51a19ad into main May 31, 2026
9 checks passed
@cscheid cscheid deleted the perf/tree-sitter-error-logging branch May 31, 2026 14:52
cscheid added a commit that referenced this pull request Jun 1, 2026
…rcw)

`native_visitor` declared `whitespace_re` as a function-local
`Lazy<Regex>`, so a fresh `Lazy` was built and `\s+` recompiled on every
call that reached the inline whitespace check. A samply profile of a
full `q2 render` of the 565-file qmd-plans website attributed ~14% of
self-time to the regex NFA/DFA compiler under `native_visitor`
(20,806 compilations over the corpus).

Hoist it to a module-level `static WHITESPACE_RE`, compiled once.

Impact (median of 5, release builds):
- end-to-end `q2 render`: 4720 ms -> 3540 ms (~-24%; render parses each
  file twice, Pass 1 profile + Pass 2 render, so the per-parse win doubles)
- isolated parse path: -11.5% (1 thread), -9.9% (16 threads)
- profile: regex-compile self-time 14.3% -> 0.02%, total CPU samples -17%

Adds a regression test (`test_whitespace_re_compile_once`) that asserts
the regex compiles <= 1x per process via a `WHITESPACE_RE_COMPILE_COUNT`
counter, and a `parse-corpus` perf-harness driver used to measure the
serial/parallel impact. The driver also surfaced that the parse path
scales 7.1x on 16 threads with zero lock contention post-#247 — see
bd-3gj56 (parallelize Pass 2).

Plan + full numbers: claude-notes/plans/2026-06-01-render-perf-profiling.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant