Skip to content

fix(zebrad): retry explicit missing inventory without poisoning peers#10633

Open
gustavovalverde wants to merge 2 commits into
mainfrom
fix/sync-missing-inventory-retry
Open

fix(zebrad): retry explicit missing inventory without poisoning peers#10633
gustavovalverde wants to merge 2 commits into
mainfrom
fix/sync-missing-inventory-retry

Conversation

@gustavovalverde
Copy link
Copy Markdown
Member

@gustavovalverde gustavovalverde commented May 26, 2026

Motivation

High-concurrency block sync could fail to recover cleanly after missing predecessor blocks. MissingInventoryCollector::send() marked all inventory in a failed request as missing for any request error, including timeouts, dropped connections, and overload. That collapsed "request failed" into "peer does not have this block."

Once the registry had those false misses, later downloads could fail with PeerError::NotFoundRegistry wrapped in SharedPeerError and BlockDownloadVerifyError::DownloadFailed. The sync loop then had too little typed information: peer notfound, registry notfound, duplicate commit errors, and generic transport errors were partly distinguished by debug-string checks. In local Z3 testing with high concurrency, this showed up as sync repeatedly failing or restarting instead of prioritizing the missing predecessor and continuing.

The conclusion came from Z3 mainnet and testnet observations plus code path tracing. Logs and metrics showed sync progress and restart behavior; code inspection showed all request errors updated the missing-inventory registry; the sync loop handled notfound without a bounded predecessor retry. Existing issue #2908 also captures the need for typed VerifyBlockError::Commit handling so sync decisions do not depend on fragile error wrapping.

The failure mode is not specific to tip operation; IBD nodes hit the same cycle. With checkpoint_verify_concurrency_limit near its default of 1000, the syncer's lookahead saturates with hundreds of in-flight blocks. When BLOCK_VERIFY_TIMEOUT trips on a stuck predecessor, the registry poisoning takes out the whole batch. The next sync attempt then logs exhausted prospective tip set.

Related to #10630

Solution

This keeps the fix focused on the broken contract: missing inventory is now a typed, explicit signal, not a side effect of any transport failure.

  • Keep missing-inventory registry updates to explicit notfound responses. Generic transport failures no longer mark all requested inventory as missing.
  • Preserve typed PeerErrorKind and explicit missing inventory in SharedPeerError, so sync can classify peer notfound and local registry notfound without debug-string matching.
  • Add a bounded in-place retry for peer-reported missing blocks. When a predecessor is missing, Zebra cancels newer in-flight downloads, retries the missing hash first, then requeues interrupted work behind it.
  • Restart sync when the local registry says all ready peers are missing a block, or when the per-hash retry limit is exhausted.
  • Use zero delay only after successful progress, and a short 5-second delay when a failed sync run still made progress. This keeps responsive catch-up without a tight loop.
  • Convert wrapped CommitSemanticallyVerifiedError duplicate commits back into VerifyBlockError::Commit, so duplicate classification remains typed through the semantic verifier.
  • Remove dead request inventory helpers and add targeted tests around missing-inventory retry, restart classification, single-hash tip behavior, and duplicate commit wrapping.

Tests

  • Local Z3 stack deployment on mainnet and testnet:
    • Post-deploy logs showed mainnet accepting blocks 3355579 and 3355580, and testnet accepting blocks 4031310 through 4031316.
    • Recent logs showed no panic, fatal error, OOM, notfound, or missing-inventory retry loop.

Specifications & References

Follow-up Work

No immediate follow-up is required for this PR. Broader sync architecture work can separately evaluate peer scoring or stricter interrupted-hash ordering, but neither is required to fix the registry poisoning and bounded missing-inventory recovery path.

AI Disclosure

  • No AI tools were used in this PR
  • AI tools were used: OpenAI coding assistant for implementation support, review cleanup, PR drafting, and validation orchestration.

PR Checklist

  • The PR title follows conventional commits format: type(scope): description
  • The PR follows the contribution guidelines.
  • This change was discussed in an issue or with the team beforehand.
  • The solution is tested.
  • The documentation and changelogs are up to date.

High-concurrency block sync was treating transport failures and timeouts as proof that a peer lacked every requested block. That marked unrelated inventory as missing, poisoned the peer routing registry, and could leave sync restarting instead of recovering when a predecessor block was reported missing.

The fix keeps the missing-inventory registry tied to explicit notfound signals, preserves typed missing-inventory information on SharedPeerError, and routes peer-reported missing blocks through a bounded in-place retry. When a missing predecessor appears, Zebra cancels newer in-flight downloads and retries that block ahead of later hashes so verifier capacity is not spent on descendants waiting behind a gap.

This also converts wrapped CommitSemanticallyVerifiedError duplicate commits back into VerifyBlockError::Commit, replaces stringly NotFound checks with typed classifications, adds retry and cancellation metrics, and uses a short restart delay when a failed sync run still made progress.

Validated with targeted sync and consensus tests, cargo check for the affected crates, diff hygiene checks, and a local Z3 stack deployment on mainnet and testnet using the built image for this exact diff.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Zebra’s block sync reliability under high concurrency by ensuring “missing inventory” is only recorded from explicit notfound signals (not generic transport failures), and by adding a bounded, in-place retry path for missing predecessor blocks so sync can recover without poisoning peer routing.

Changes:

  • Restrict missing-inventory registry updates to explicit peer notfound responses; preserve typed peer error classification through SharedPeerError.
  • Add bounded missing-predecessor retry that cancels newer in-flight downloads, retries the missing hash first, then requeues interrupted/pending hashes behind it.
  • Improve typed duplicate-commit propagation by unwrapping CommitSemanticallyVerifiedError back into VerifyBlockError::Commit, and add targeted unit tests + changelog entry.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
zebrad/src/components/sync.rs Adds typed missing-inventory classification, bounded predecessor retry + restart behavior adjustments and retry backoffs.
zebrad/src/components/sync/downloads.rs Adds cancellation API that returns interrupted hashes; switches cancel-handle map to IndexMap.
zebrad/src/components/sync/tests/vectors.rs Adds tests for single-hash obtain-tips, restart classification, missing-inventory retry ordering/limits, and in-flight cancellation behavior.
zebra-network/src/peer/error.rs Refactors SharedPeerError to preserve a machine-readable kind and explicit missing inventory payload.
zebra-network/src/peer/client.rs Stops treating transport failures as missing inventory; forwards only explicit missing inventory signals.
zebra-network/src/peer/client/tests/vectors.rs Adds tests ensuring transport errors don’t mark inventory missing and explicit notfound does.
zebra-network/src/protocol/internal/request.rs Removes unused inventory helper accessors from Request.
zebra-state/src/error.rs Exposes helpers on CommitSemanticallyVerifiedError for typed duplicate detection and access to inner commit error.
zebra-consensus/src/block.rs Downcasts state commit failures from CommitSemanticallyVerifiedError and returns typed VerifyBlockError::Commit.
zebra-consensus/src/block/tests.rs Adds a test ensuring wrapped duplicate commit errors remain typed as VerifyBlockError::Commit.
CHANGELOG.md Documents the sync fix and retry/backoff behavior under “Fixed”.

Comment thread zebra-network/src/peer/error.rs
@gustavovalverde gustavovalverde changed the title fix(sync): retry explicit missing inventory without poisoning peers fix(zebrad): retry explicit missing inventory without poisoning peers May 26, 2026
@gustavovalverde gustavovalverde self-assigned this May 26, 2026
Copy link
Copy Markdown
Contributor

@oxarbitrage oxarbitrage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the retry path necessary to observe the fix working in practice, or is it optional for this PR?

@gustavovalverde
Copy link
Copy Markdown
Member Author

Is the retry path necessary to observe the fix working in practice, or is it optional for this PR?

It is necessary for the practical fix, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants