fix(zebrad): retry explicit missing inventory without poisoning peers#10633
fix(zebrad): retry explicit missing inventory without poisoning peers#10633gustavovalverde wants to merge 2 commits into
Conversation
High-concurrency block sync was treating transport failures and timeouts as proof that a peer lacked every requested block. That marked unrelated inventory as missing, poisoned the peer routing registry, and could leave sync restarting instead of recovering when a predecessor block was reported missing. The fix keeps the missing-inventory registry tied to explicit notfound signals, preserves typed missing-inventory information on SharedPeerError, and routes peer-reported missing blocks through a bounded in-place retry. When a missing predecessor appears, Zebra cancels newer in-flight downloads and retries that block ahead of later hashes so verifier capacity is not spent on descendants waiting behind a gap. This also converts wrapped CommitSemanticallyVerifiedError duplicate commits back into VerifyBlockError::Commit, replaces stringly NotFound checks with typed classifications, adds retry and cancellation metrics, and uses a short restart delay when a failed sync run still made progress. Validated with targeted sync and consensus tests, cargo check for the affected crates, diff hygiene checks, and a local Z3 stack deployment on mainnet and testnet using the built image for this exact diff.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Pull request overview
This PR improves Zebra’s block sync reliability under high concurrency by ensuring “missing inventory” is only recorded from explicit notfound signals (not generic transport failures), and by adding a bounded, in-place retry path for missing predecessor blocks so sync can recover without poisoning peer routing.
Changes:
- Restrict missing-inventory registry updates to explicit peer
notfoundresponses; preserve typed peer error classification throughSharedPeerError. - Add bounded missing-predecessor retry that cancels newer in-flight downloads, retries the missing hash first, then requeues interrupted/pending hashes behind it.
- Improve typed duplicate-commit propagation by unwrapping
CommitSemanticallyVerifiedErrorback intoVerifyBlockError::Commit, and add targeted unit tests + changelog entry.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
zebrad/src/components/sync.rs |
Adds typed missing-inventory classification, bounded predecessor retry + restart behavior adjustments and retry backoffs. |
zebrad/src/components/sync/downloads.rs |
Adds cancellation API that returns interrupted hashes; switches cancel-handle map to IndexMap. |
zebrad/src/components/sync/tests/vectors.rs |
Adds tests for single-hash obtain-tips, restart classification, missing-inventory retry ordering/limits, and in-flight cancellation behavior. |
zebra-network/src/peer/error.rs |
Refactors SharedPeerError to preserve a machine-readable kind and explicit missing inventory payload. |
zebra-network/src/peer/client.rs |
Stops treating transport failures as missing inventory; forwards only explicit missing inventory signals. |
zebra-network/src/peer/client/tests/vectors.rs |
Adds tests ensuring transport errors don’t mark inventory missing and explicit notfound does. |
zebra-network/src/protocol/internal/request.rs |
Removes unused inventory helper accessors from Request. |
zebra-state/src/error.rs |
Exposes helpers on CommitSemanticallyVerifiedError for typed duplicate detection and access to inner commit error. |
zebra-consensus/src/block.rs |
Downcasts state commit failures from CommitSemanticallyVerifiedError and returns typed VerifyBlockError::Commit. |
zebra-consensus/src/block/tests.rs |
Adds a test ensuring wrapped duplicate commit errors remain typed as VerifyBlockError::Commit. |
CHANGELOG.md |
Documents the sync fix and retry/backoff behavior under “Fixed”. |
oxarbitrage
left a comment
There was a problem hiding this comment.
Is the retry path necessary to observe the fix working in practice, or is it optional for this PR?
It is necessary for the practical fix, yes. |
Motivation
High-concurrency block sync could fail to recover cleanly after missing predecessor blocks.
MissingInventoryCollector::send()marked all inventory in a failed request as missing for any request error, including timeouts, dropped connections, and overload. That collapsed "request failed" into "peer does not have this block."Once the registry had those false misses, later downloads could fail with
PeerError::NotFoundRegistrywrapped inSharedPeerErrorandBlockDownloadVerifyError::DownloadFailed. The sync loop then had too little typed information: peernotfound, registrynotfound, duplicate commit errors, and generic transport errors were partly distinguished by debug-string checks. In local Z3 testing with high concurrency, this showed up as sync repeatedly failing or restarting instead of prioritizing the missing predecessor and continuing.The conclusion came from Z3 mainnet and testnet observations plus code path tracing. Logs and metrics showed sync progress and restart behavior; code inspection showed all request errors updated the missing-inventory registry; the sync loop handled
notfoundwithout a bounded predecessor retry. Existing issue #2908 also captures the need for typedVerifyBlockError::Commithandling so sync decisions do not depend on fragile error wrapping.The failure mode is not specific to tip operation; IBD nodes hit the same cycle. With
checkpoint_verify_concurrency_limitnear its default of 1000, the syncer's lookahead saturates with hundreds of in-flight blocks. WhenBLOCK_VERIFY_TIMEOUTtrips on a stuck predecessor, the registry poisoning takes out the whole batch. The next sync attempt then logsexhausted prospective tip set.Related to #10630
Solution
This keeps the fix focused on the broken contract: missing inventory is now a typed, explicit signal, not a side effect of any transport failure.
notfoundresponses. Generic transport failures no longer mark all requested inventory as missing.PeerErrorKindand explicit missing inventory inSharedPeerError, so sync can classify peernotfoundand local registrynotfoundwithout debug-string matching.CommitSemanticallyVerifiedErrorduplicate commits back intoVerifyBlockError::Commit, so duplicate classification remains typed through the semantic verifier.Tests
3355579and3355580, and testnet accepting blocks4031310through4031316.notfound, or missing-inventory retry loop.Specifications & References
Follow-up Work
No immediate follow-up is required for this PR. Broader sync architecture work can separately evaluate peer scoring or stricter interrupted-hash ordering, but neither is required to fix the registry poisoning and bounded missing-inventory recovery path.
AI Disclosure
PR Checklist
type(scope): description