Skip to content

fix(trie): blinded-sibling collapse when multiple subtries empty in parallel#23021

Open
pepyakin wants to merge 4 commits intomainfrom
pep/arena-pst/parallel-bug-repro
Open

fix(trie): blinded-sibling collapse when multiple subtries empty in parallel#23021
pepyakin wants to merge 4 commits intomainfrom
pep/arena-pst/parallel-bug-repro

Conversation

@pepyakin
Copy link
Member

@pepyakin pepyakin commented Mar 13, 2026

The Bug

Trie structure at the time of update_leaves

root (branch)
 └─ 0xd (branch, 3 children)
      ├─ 0xd7 → Subtrie (1 leaf, revealed)
      ├─ 0xd8 → Subtrie (1 leaf, BLINDED — no update touches it)
      └─ 0xdd → Subtrie (1 leaf, revealed)

What happens

  1. update_leaves walks the sorted updates sequentially and encounters subtrie 0xd7 (all-removal update).

  2. Pre-check check_subtrie_collapse_needs_proof runs for 0xd7:

    • All updates are removals? ✓ (num_removals == subtrie_updates.len())
    • Would empty the subtrie? ✓ (num_removals >= num_leaves)
    • Parent has exactly 2 children? ✗ — it has 3count_bits() != 2returns None (no proof needed)
  3. 0xd7 is taken for parallel processing.

  4. Same thing for 0xdd — identical check, identical result. Taken.

  5. Both subtries are processed in parallel. Both become EmptyRoot.

  6. Post-restore unwrap phase iterates taken paths:

    • Seek to 0xd7 → empty → maybe_unwrap_subtrie → removes child from parent. Parent 0xd now has 2 children: {0xd8 (blinded), 0xdd (empty subtrie)}.
    • maybe_collapse_or_remove_branch runs → count=2, no-op.
    • Seek to 0xdd → empty → maybe_unwrap_subtrie → removes child from parent. Parent 0xd now has 1 child: {0xd8 (blinded)}.
    • maybe_collapse_or_remove_branch runs → count=1, sole remaining child is blindedpanic.

Why the pre-check missed it

check_subtrie_collapse_needs_proof evaluated each subtrie independently against the pre-update parent shape. It only requested a proof when the parent had exactly 2 children (the subtrie being emptied + one blinded sibling). With 3+ children, it assumed removing one subtrie was safe — without considering that other siblings in the same batch would also be removed.

The Fix

Remove the count_bits() == 2 gate and sibling_child() call (which asserts exactly 2 children). Instead, scan all parent children for any blinded sibling. If a subtrie would be fully emptied and the parent has any blinded child anywhere, conservatively request a proof upfront.

Diff

The change is entirely within check_subtrie_collapse_needs_proof. The early-exit conditions (not all removals, wouldn't empty the subtrie) are unchanged. Only the parent-shape check changes:

Before:

if parent_branch.state_mask.count_bits() != 2 {
    return None;
}
if !parent_branch.sibling_child(child_nibble).is_blinded() {
    return None;
}
let sibling_nibble = parent_branch
    .state_mask
    .iter()
    .find(|&n| n != child_nibble)
    .expect("branch has two children");

After:

let blinded_sibling_nibble = parent_branch
    .child_iter()
    .find(|(nibble, child)| *nibble != child_nibble && child.is_blinded())
    .map(|(nibble, _)| nibble)?;

Conservatism

When check_subtrie_collapse_needs_proof returns Some, the subtrie's updates are put back into the updates map, the subtrie is skipped entirely, and a proof is requested. On the next reveal-update cycle, the proof reveals the blinded sibling, and the subtrie's updates are retried — this time the parent has no blinded children so the check passes and the deletions go through normally.

The false-positive scenario:

Parent branch at nibble 0xd, 4 children:
  0xd3 → Subtrie (revealed, NOT being updated)
  0xd7 → Subtrie (revealed, being emptied by all-deletion updates)
  0xd8 → BLINDED (no updates)
  0xdd → Subtrie (revealed, NOT being updated)
  • Old code: parent has 4 children → count_bits() != 2 → returns None → subtrie 0xd7 is taken and emptied → post-restore unwrap removes it → parent has 3 children {0xd3, 0xd8, 0xdd}count >= 2 → no collapse → safe, no problem.

  • New code: parent has a blinded child 0xd8 → returns Some(proof for 0xd8) → subtrie 0xd7's updates are skipped and re-queued → proof for 0xd8 is requested → extra reveal-update round-trip to apply the same deletions that would have worked fine the first time.

Cost: one extra iteration of the reveal-update loop. The deletions still succeed, the hash is still correct — just one unnecessary round-trip through update_leaves → proof request → reveal_nodesupdate_leaves.

How often: requires ALL of these simultaneously:

  1. A subtrie where every single update is a deletion
  2. Those deletions empty the subtrie completely (num_removals >= num_leaves)
  3. The parent has a blinded child
  4. The parent has other revealed children that survive (so the collapse-to-blinded never actually happens)

This is very narrow. In typical usage, subtries at depth 2 under the same parent rarely have all-deletion batches that fully empty them, and if the parent has blinded children it usually means the trie is partially revealed — which is the exact scenario where an extra proof round-trip is near-free anyway.

A tighter fix would need to predict which other sibling subtries will also be emptied in the same batch, but check_subtrie_collapse_needs_proof runs per-subtrie during a sequential walk — it doesn't have lookahead into future subtries. You'd need a two-pass approach or parent-level grouping, which is more code for a marginal improvement on a rare edge case.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

⚠️ Changelog not found.

A changelog entry is required before merging. We've generated a suggested changelog based on your changes:

Preview
---
reth-trie-sparse: patch
---

Fixed blinded-sibling collapse panic when multiple sibling subtries are emptied in the same parallel batch. Previously, `check_subtrie_collapse_needs_proof` only handled the case where the parent branch had exactly two children and the single sibling was blinded; it now finds any blinded sibling among an arbitrary number of children, preventing the branch from collapsing to a lone blinded child during post-restore unwrap.

Add changelog to commit this to your branch.

@pepyakin pepyakin force-pushed the pep/arena-pst/parallel-bug-repro branch from f4a1d4d to 76211ae Compare March 13, 2026 13:35
@pepyakin pepyakin changed the title test: reproduce blinded-sibling collapse with forced parallel path fix(trie): blinded-sibling collapse when multiple subtries empty in parallel Mar 13, 2026
@pepyakin pepyakin added the cyclops Trigger Cyclops PR audit label Mar 13, 2026
@pepyakin pepyakin marked this pull request as ready for review March 13, 2026 14:37
@decofe
Copy link
Member

decofe commented Mar 13, 2026

👁️ Cyclops Security Review

dd13387

🧭 Auditing · mode=normal · workers 0/3 done (3 left) · verify pending 1

Worker Engine Progress Status
pr-23021-w1 gemini-3.1-pro-preview ✅ thread-1 🚨 thread-2 🔍 thread-3 Running
pr-23021-w2 amp/deep 🔍 thread-1 · · Running
pr-23021-w3 gpt-5.4 🔍 thread-1 · · Running

Findings

# Finding Severity Verification Threads
1 Incomplete fix: Blinded-sibling collapse panic can still be triggered via Touched updates High ⏳ Pending audit
⚙️ Controls
  • 🚀 Keep only 1 remaining iteration per worker after the current work finishes.
  • 👀 Keep only 2 remaining iterations per worker after the current work finishes.
  • ❤️ Let only worker 1 continue; other workers skip queued iterations.
  • 😄 Let only worker 2 continue; other workers skip queued iterations.
  • 🎉 End faster by skipping queued iterations and moving toward consolidation.
  • 😕 Stop active workers/verifiers now and start consolidation immediately.

📜 9 events

🔍 pr-23021-w1 iter 1/3 [audit-ripple.md]
🔍 pr-23021-w2 iter 1/3 [audit-focused.md]
🔍 pr-23021-w3 iter 1/3 [audit-deep-focus.md]
pr-23021-w1 iter 1 — clear | Thread
🔍 pr-23021-w1 iter 2/3 [audit-historical.md]
🚨 pr-23021-w1 iter 2 — finding | Thread
🚨 Finding: Incomplete fix: Blinded-sibling collapse panic can still be triggered via Touched updates (High) | Thread
🔍 pr-23021-w1 iter 3/3 [audit-focused.md]
🔬 Verifying: Incomplete fix: Blinded-sibling collapse panic can still be triggered via Touched updates | Thread

@pepyakin
Copy link
Member Author

derek bench

@decofe
Copy link
Member

decofe commented Mar 13, 2026

cc @pepyakin

✅ Benchmark complete! View job

Benchmark Results

⚠️ Feature is 10 commits behind main. Consider rebasing for accurate results.

Metric main pep/arena-pst/parallel-bug-repro Change
Mean 24.84ms 24.81ms -0.10% ⚪ (±0.32%)
StdDev 15.90ms 15.96ms
P50 21.76ms 21.50ms -1.22% ✅ (±1.16%)
P90 35.75ms 36.66ms +2.54% ⚪ (±2.76%)
P99 97.57ms 98.78ms +1.24% ⚪ (±1.79%)
Mgas/s 1289.46 1292.45 +0.23% ⚪ (±0.35%)
Wall Clock 25.65s 25.67s +0.08% ⚪ (±0.37%)

500 blocks

Wait Time Breakdown

Persistence Wait

Metric main pep/arena-pst/parallel-bug-repro
Mean 358.44ms 356.58ms
P50 344.74ms 342.79ms
P95 542.34ms 542.96ms

Trie Cache Update Wait

Metric main pep/arena-pst/parallel-bug-repro
Mean 5.28ms 5.31ms
P50 5.47ms 5.49ms
P95 7.12ms 7.22ms

Execution Cache Update Wait

Metric main pep/arena-pst/parallel-bug-repro
Mean 0.00ms 0.00ms
P50 0.00ms 0.00ms
P95 0.00ms 0.00ms

Grafana Dashboard

Charts

Latency, Throughput & Diff

Latency, Throughput & Diff

Wait Time Breakdown

Wait Time Breakdown

Gas vs Latency

Gas vs Latency

Grafana Dashboard

View real-time metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cyclops Trigger Cyclops PR audit

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants