-
Notifications
You must be signed in to change notification settings - Fork 26
refactor: use beacon chain deadlines for QBFT instance cleanup #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Adds test to verify QBFT Committee instances can reach late rounds (9+) as configured with max_round=12. The test creates a Committee instance, forces round changes by keeping operators offline, then advances through multiple slots while verifying the instance survives to reach round 10. Currently fails - instance is cleaned up after 2 slots, reaching round 9 but unable to complete it (needs 120s, gets 8s).
|
Claude finished @diegomrsantos's task —— View job Code Review CompleteI've analyzed this PR, which adds a test demonstrating a QBFT instance cleanup issue. The test is well-intentioned but has several areas for improvement. SummaryThis PR adds a single test ( The Core Issue:
Code Quality AssessmentPositive Aspects ✅
Issues Identified
|
| Category | Rating | Notes |
|---|---|---|
| Code Quality | Test logic sound, but execution state unclear | |
| Security | ✅ Good | No concerns |
| Performance | ✅ Good | Efficient test design |
| Testing | Will break CI, needs #[ignore] or restructuring |
|
| Documentation | Math errors, unclear expectations |
Recommendations
Before Merging (Required)
- Fix the test execution state: Add
#[ignore]or restructure as negative test - Correct the timing calculations in comments (see inline reviews)
- Create a GitHub issue tracking the cleanup bug and reference it in comments
- Update PR description to clarify only 1 file is actually being changed
Consider (Suggested)
- Add intermediate assertions verifying instance survival and round progression
- Reduce loop bound from 50 to realistic value (22-25)
- Add a companion test verifying current cleanup behavior explicitly
- Consider rebasing onto latest
stableto reduce diff noise
Follow-up Work (Separate PR)
- Fix the root cause: Adjust
QBFT_RETAIN_SLOTSor implement role-based/event-based cleanup - Remove
#[ignore]from this test once fix is implemented - Add tests for other duty types (Proposer, SyncCommittee, Aggregator)
Inline Comments
I've left 4 detailed inline comments on specific code sections. Please review each for technical details on:
- Test logic issues (loop bounds)
- Mathematical errors (timing calculations)
- Test design improvements (boundary testing)
- Documentation clarity (failure expectations)
Conclusion: This is valuable work that identifies a real production issue. With the recommended changes (particularly adding #[ignore] and fixing comment math), this test will serve as excellent documentation of the problem and a clear success criterion for the fix.
anchor/qbft_manager/src/tests.rs
Outdated
|
|
||
| // Advance through multiple slots while QBFT progresses | ||
| // This triggers cleanup logic which should NOT remove the active instance | ||
| for slot in 1..=50 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test Logic Issue: Unrealistic loop bound
The test loops to slot 50, but breaks at slot 22. This creates confusion about the test's intent. Consider these improvements:
- Use a more reasonable upper bound (e.g.,
1..=25) that better reflects when you expect the test to complete - Add assertion after the loop to verify we actually broke at slot 22 and didn't fall through
- Consider making this data-driven: Calculate the expected slot based on round timeout constants rather than hard-coding slot 22
const EXPECTED_ROUND_10_SLOT: u64 = 22; // Document the calculation
for slot in 1..=EXPECTED_ROUND_10_SLOT + 3 { // Small buffer
clock.set_slot(slot);
tokio::time::sleep(slot_duration).await;
if slot == EXPECTED_ROUND_10_SLOT {
context.set_operators_online(&[2, 3, 4]);
break;
}
}
// Verify we didn't fall through
assert!(clock.now().unwrap().as_u64() == EXPECTED_ROUND_10_SLOT);| for slot in 1..=50 { | ||
| clock.set_slot(slot); | ||
| tokio::time::sleep(slot_duration).await; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mathematical Accuracy: Comment contains calculation error
The comment states "Rounds 1-8: 16s" but this is incorrect:
- Rounds 1-8 each take 2 seconds (QUICK_TIMEOUT)
- Total for rounds 1-8:
8 * 2s = 16s✓ - But the comment should clarify this is cumulative, not per-round
Correct breakdown to round 10:
- Rounds 1-8:
8 × 2s = 16s - Round 9:
120s - Round 10:
120s - Total: 256s = 21.33 slots
At 12s per slot:
- 256s ÷ 12s/slot = 21.33 slots
- So slot 22 is actually 8 seconds into round 10, not "around round 10"
Suggest updating comment to:
// At slot 22 (264 seconds = 22 * 12s):
// - Rounds 1-8: 8 * 2s = 16s
// - Round 9: 120s
// - Total to complete round 9: 136s (11.33 slots)
// - Round 10 starts at 136s (slot 11), so at slot 22 (264s) we're 128s into round 10
if slot == 22 {This makes the test's timing expectations explicit and verifiable.
| // Keep 3 operators offline initially to prevent consensus and force round changes. | ||
| // With only 1 operator online out of 4, we cannot reach quorum (need 3). | ||
| // This will cause the instance to go through multiple round changes. | ||
| context.set_operators_offline(&[2, 3, 4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test Design: Consider testing the actual cleanup boundary
The test keeps 3 out of 4 operators offline to force round changes, which is good. However, it doesn't verify the specific failure mode mentioned in the PR description. Consider adding:
- Log inspection or metrics to verify the instance actually progresses through rounds 9-10
- Intermediate assertions to check the instance is still alive at critical points (e.g., after slot 2 when cleanup occurs)
- Test the boundary condition: What happens at exactly slot 2 when cleanup runs?
Example enhancement:
// Keep 3 operators offline to force round changes
context.set_operators_offline(&[2, 3, 4]);
// Advance to slot 2 where cleanup happens (cutoff = slot 2 - 1 = slot 1)
// At this point, the instance starting at slot 0 should be removed
clock.set_slot(1);
tokio::time::sleep(slot_duration).await;
clock.set_slot(2);
tokio::time::sleep(slot_duration).await;
// TODO: Add assertion here to verify instance is still alive
// This is where the bug manifests - instance gets cleaned up too earlyThis would make the test more explicitly demonstrate the issue described in the PR.
| // The test simulates network conditions where consensus cannot be reached early by keeping | ||
| // all but one operator offline, forcing round changes. We advance the slot to trigger | ||
| // cleanup and verify the instance survives to reach round 10. | ||
| async fn test_committee_can_reach_late_rounds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation: Test expectations need clarification
The comment says "Currently fails" but doesn't specify:
- What failure mode to expect - Does it panic? Timeout? Return an error?
- Whether this is expected behavior - Is this test marked with
#[should_panic]or#[ignore]? - When it should pass - What changes need to be made for this test to pass?
Recommendations:
- If the test is expected to fail, use
#[ignore]or#[should_panic]:
#[tokio::test(start_paused = true)]
#[ignore = "Fails due to premature cleanup - see issue #XXX"]
async fn test_committee_can_reach_late_rounds() {- Or, restructure as a negative test that explicitly verifies the current behavior:
#[tokio::test(start_paused = true)]
async fn test_committee_cleanup_prevents_late_rounds() {
// Explicitly test that instances ARE cleaned up at slot 2
// This documents current behavior before fix is implemented- Add a GitHub issue reference so the test can be tracked to a fix
This prevents CI from failing and clearly communicates test intent to future developers.
4cd49f3 to
1d995b4
Compare
1d995b4 to
2cc3d0d
Compare
Replace slot-based cleanup with duty-specific beacon chain inclusion deadlines. This allows QBFT instances to progress through all configured rounds without premature removal. Key changes: - Separate instance identity from manager metadata using ManagedInstance wrapper - Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1) - Add slots_per_epoch configuration parameter - Implement dual-trigger cleaner (completion notification + deadline timeout) Fixes instances being cleaned after 2 slots, now properly respecting beacon chain inclusion windows (32-63 slots for attestations).
2cc3d0d to
4767f9d
Compare
|
Claude finished @diegomrsantos's task —— View job Code Review CompleteI've thoroughly reviewed this PR, which adds both a test demonstrating a QBFT cleanup issue AND implements a comprehensive fix. This is excellent work that addresses a production-critical bug. SummaryThis PR successfully implements a duty-aware cleanup system for QBFT instances, replacing the overly aggressive slot-based cleanup that was preventing Committee instances from reaching late rounds (9-12). The solution is architecturally sound and follows best practices. Key Achievement: Instances configured with Changes OverviewCommit 1: Test (82eaa98)Adds Commit 2: Fix (4767f9d)
Code Quality AssessmentStrengths ✅
Issues Identified
|
| Category | Rating | Notes |
|---|---|---|
| Code Quality | ✅ Excellent | Clean, maintainable, follows best practices |
| Security | ✅ Good | No concerns, improves liveness |
| Performance | ✅ Good | No regressions, potentially better |
| Testing | ✅ Good | Comprehensive, could add more edge cases |
| Documentation | Minor math error, otherwise excellent | |
| Architecture | ✅ Excellent | Well-designed, scalable solution |
Recommendations
Before Merging (Optional Improvements)
- Fix test comment math at line 942-943 (see inline comment)
- Reduce loop bound from 50 to 25 for clarity (see inline comment)
- Add doc comment to
ManagedInstanceexplaining its purpose (see inline comment) - Enhance
calculate_deadlinecomment with EIP-7045 calculation detail (see inline comment)
Follow-up Work (Separate PRs)
- Add tests for other duty types (Proposer, SyncCommittee)
- Add test for deadline expiry behavior
- Add test verifying immediate cleanup on completion
- Consider adding metrics for instance cleanup (completion vs deadline)
Conclusion
This is excellent work that solves a real production issue with a well-architected solution. The code is clean, follows Rust best practices, and adheres to the project's guidelines. The few minor issues identified are documentation improvements rather than functional problems.
Recommendation: ✅ Approve with minor documentation improvements
The PR is ready to merge as-is. The suggested documentation improvements would make the code even better but are not blockers.
Review Summary: I've left 4 inline comments with specific suggestions for documentation improvements. All issues are minor and don't block merging. The implementation is sound and production-ready.
anchor/qbft_manager/src/tests.rs
Outdated
| tokio::time::sleep(slot_duration).await; | ||
|
|
||
| // At slot 22 (256 seconds = 16s + 240s), we should be around round 10 | ||
| // Rounds 1-8: 16s, Rounds 9-10: 240s = 256s total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math Error in Comment
The comment's calculation is incorrect. Here's the accurate breakdown:
Correct Timing:
- Rounds 1-8:
8 × 2s = 16s(completes at ~slot 1.33) - Round 9:
120s(completes at 136s / ~slot 11.33) - Round 10:
120s(starts at 136s) - At slot 22 (264s): We're 128 seconds into round 10 (with 8s remaining)
Suggested fix:
// At slot 22 (264 seconds = 22 * 12s):
// - Rounds 1-8: 8 * 2s = 16s (completes at slot 1.33)
// - Round 9: 120s (completes at 136s / slot 11.33)
// - Round 10: started at 136s, now 128s into it
if slot == 22 {This makes the timing expectations explicit and verifiable.
| // Map from an identifier to a sender for the instance | ||
| type Map<I, D> = DashMap<I, UnboundedSender<QbftMessage<D>>>; | ||
| // Manager's bookkeeping for an instance | ||
| pub struct ManagedInstance<D: QbftData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider Adding Documentation
The ManagedInstance struct is a key part of the new architecture but lacks documentation. Consider adding:
/// Manager's bookkeeping for a QBFT instance.
///
/// Tracks the communication channel and beacon chain inclusion deadline
/// for each active instance. The `deadline` field determines when the
/// instance should be cleaned up if not completed earlier. Instances are
/// removed either when they complete (via completion notification) or when
/// their deadline expires (via the periodic cleaner sweep).
pub struct ManagedInstance<D: QbftData> {
sender: UnboundedSender<QbftMessage<D>>,
deadline: types::Slot,
}This helps future maintainers understand the purpose and lifecycle management.
| /// Calculate the beacon chain inclusion deadline for a duty | ||
| fn calculate_deadline(role: Role, slot: types::Slot, slots_per_epoch: u64) -> types::Slot { | ||
| match role { | ||
| Role::Committee | Role::Aggregator => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance Comment for Clarity
The calculation (epoch.as_u64() + 2) * slots_per_epoch - 1 implements EIP-7045 correctly, but the "why" could be clearer:
Suggested enhancement:
Role::Committee | Role::Aggregator => {
// Attestations can be included until end of next epoch (epoch E+1)
// Per EIP-7045: attestation from epoch E valid until end of epoch E+1
//
// Calculation explanation:
// - Epoch E+1 starts at: (E+1) * slots_per_epoch
// - Epoch E+1 ends at: (E+2) * slots_per_epoch - 1 (last slot of E+1)
let epoch = slot.epoch(slots_per_epoch);
types::Slot::new((epoch.as_u64() + 2) * slots_per_epoch - 1)
}This makes it immediately clear why we use E+2 in the calculation.
Improve test readability by applying Setup/Execute/Assert structure: - Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.) - Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests - Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role) - Add descriptive assertion messages explaining what must be true - Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.) All 23 tests pass (up from 18 due to role deadline test split). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add Setup/Execute/Assert pattern as mandatory requirement for all tests: CLAUDE.md: - Add "Writing Tests" section requiring tester-subagent usage before writing any test code - Ensures consistent test structure across codebase tester-subagent.md: - Add section 4 "Test Structure Requirements" with generic example - Define 4 mandatory requirements: section comments, named constants, assertion messages, one behavior per test - List anti-patterns to reject (no section comments, magic numbers, silent assertions, mixed code) - Update description to emphasize "MUST BE USED before writing ANY test code" This ensures all new tests follow readable, maintainable patterns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add detailed explanation for Committee/Aggregator deadline calculation: - Document the calculation formula: (E+2) * slots_per_epoch - 1 - Explain that this represents the last slot for on-chain inclusion - Reference EIP-7045 specification Enhance ManagedInstance documentation: - Convert to doc comment for better API documentation - Clarify that it tracks both channel and beacon chain deadline - Explain its role in the cleanup task 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Issue Addressed
Fixes instance cleanup issue where QBFT instances were cleaned up too early based on slot-based timeouts (QBFT_RETAIN_SLOTS = 1), preventing instances from reaching later rounds and completing consensus.
Proposed Changes
Core Changes
ManagedInstancestruct tracking both channel and deadlineTest Coverage
test_cleanup_removes_only_expired_instances- Verifies instances survive past old 2-slot timeouttest_instance_completion_notification- Tests immediate cleanup after successful completiontest_committee_can_reach_late_rounds- Verifies instances can reach round 10+ with max_round=12test_cleanup_across_epoch_boundary- Tests deadline calculation across epoch transitionstest_multiple_instances_completing_rapidly- Verifies burst completion handlingCode Quality
// SETUP,// EXECUTE,// ASSERT)Test Results
All 23 tests pass (up from 18 due to test refactoring that split one oversized test into 6 focused tests).
Additional Info
This aligns instance cleanup with actual beacon chain requirements rather than arbitrary slot-based timeouts, allowing instances to complete consensus within their protocol-defined windows.