Skip to content

Conversation

@diegomrsantos
Copy link
Member

@diegomrsantos diegomrsantos commented Oct 25, 2025

Issue Addressed

Fixes instance cleanup issue where QBFT instances were cleaned up too early based on slot-based timeouts (QBFT_RETAIN_SLOTS = 1), preventing instances from reaching later rounds and completing consensus.

Proposed Changes

Core Changes

  • Refactored instance cleanup to use beacon chain inclusion deadlines instead of slot-based timeouts
  • Each role now has a deadline based on EIP-7045 and consensus spec requirements:
    • Committee/Aggregator: End of epoch E+1 (attestation inclusion window)
    • Proposer/SyncCommittee: Same slot (immediate inclusion)
    • VoluntaryExit/ValidatorRegistration: One epoch window
  • Instances are cleaned when:
    1. They complete successfully (via completion notification channel)
    2. Their beacon chain deadline expires (checked each slot)
  • Added ManagedInstance struct tracking both channel and deadline
  • Implemented dual cleanup mechanism: completion-based (immediate) and deadline-based (deferred)

Test Coverage

  • Added 5 comprehensive tests verifying deadline-based cleanup:
    • test_cleanup_removes_only_expired_instances - Verifies instances survive past old 2-slot timeout
    • test_instance_completion_notification - Tests immediate cleanup after successful completion
    • test_committee_can_reach_late_rounds - Verifies instances can reach round 10+ with max_round=12
    • test_cleanup_across_epoch_boundary - Tests deadline calculation across epoch transitions
    • test_multiple_instances_completing_rapidly - Verifies burst completion handling
  • Added 6 focused tests for role-specific deadline calculations (Committee, Aggregator, Proposer, SyncCommittee, VoluntaryExit, ValidatorRegistration)

Code Quality

  • Refactored all tests to follow Setup/Execute/Assert pattern with:
    • Clear section comments (// SETUP, // EXECUTE, // ASSERT)
    • Named constants replacing all magic numbers
    • Descriptive assertion messages
  • Added mandatory test structure guidelines to CLAUDE.md and tester-subagent.md
  • Enhanced documentation in lib.rs with detailed deadline calculation explanations

Test Results

All 23 tests pass (up from 18 due to test refactoring that split one oversized test into 6 focused tests).

Additional Info

This aligns instance cleanup with actual beacon chain requirements rather than arbitrary slot-based timeouts, allowing instances to complete consensus within their protocol-defined windows.

Adds test to verify QBFT Committee instances can reach late rounds
(9+) as configured with max_round=12.

The test creates a Committee instance, forces round changes by
keeping operators offline, then advances through multiple slots
while verifying the instance survives to reach round 10.

Currently fails - instance is cleaned up after 2 slots, reaching
round 9 but unable to complete it (needs 120s, gets 8s).
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Oct 25, 2025

Claude finished @diegomrsantos's task —— View job


Code Review Complete

I've analyzed this PR, which adds a test demonstrating a QBFT instance cleanup issue. The test is well-intentioned but has several areas for improvement.


Summary

This PR adds a single test (test_committee_can_reach_late_rounds) that successfully demonstrates a critical bug: QBFT Committee instances are cleaned up too aggressively, preventing them from reaching late rounds (9-12) despite being configured with max_round=12.

The Core Issue:

  • QBFT_RETAIN_SLOTS = 1 (line 43 in lib.rs) causes instances starting at slot 0 to be cleaned up at slot 2
  • This happens only 24 seconds into execution, during round 9's 120-second timeout
  • Committee duties need ~496 seconds (~41 slots) to complete all 12 rounds
  • Current cleanup removes them after just 2 slots (24 seconds)

Code Quality Assessment

Positive Aspects ✅

  1. Clear Test Intent: The test effectively demonstrates the cleanup issue
  2. Good Use of Test Infrastructure: Leverages existing TestContext and helper methods appropriately
  3. Proper Async Setup: Uses #[tokio::test(start_paused = true)] for deterministic time control
  4. Well-Commented: Extensive comments explain the test's purpose and timing calculations

Issues Identified ⚠️

1. Test Execution State (Critical)

The PR description states the test "currently fails," but the test is not marked as #[ignore] or #[should_panic]. This means:

  • CI will fail when this PR is merged
  • Future developers won't know if failure is expected
  • No clear path to knowing when the issue is fixed

Recommendation: Add #[ignore] with an issue reference, or restructure as a negative test that verifies current behavior.

2. Mathematical Errors in Comments

The comment at line 940 contains timing calculation errors:

  • States "Rounds 1-8: 16s, Rounds 9-10: 240s = 256s total"
  • Actually: Rounds 1-8 are 2s each (16s total), round 9 is 120s, round 10 is 120s
  • At slot 22 (264 seconds), we're actually 128 seconds into round 10, not "around round 10"

Impact: Confuses future maintainers about exact timing expectations.

3. Unrealistic Loop Bound

The test loops to slot 50 but always breaks at slot 22, creating confusion about intent. A more realistic bound (e.g., 1..=25) would better communicate expectations.

4. Missing Explicit Verification

The test doesn't verify that:

  • The instance is actually progressing through rounds 9-10
  • The instance survives past slot 2 (the critical cleanup point)
  • The specific failure mode matches the described behavior

Security Assessment 🔒

No security concerns identified. This is purely a test addition demonstrating a reliability issue.


Performance Assessment ⚡

Test Performance: Good use of start_paused = true makes this test run in simulated time, not real-time. The test should complete quickly despite simulating 264+ seconds.

Production Impact: The underlying issue this test demonstrates is production-critical:

  • Under poor network conditions requiring round changes, Committee instances will timeout prematurely
  • This affects consensus reliability for Committee duties
  • Needs urgent attention as max_round=12 is impossible to reach with current cleanup settings

Testing Assessment 🧪

Test Coverage

  • ✅ Tests the specific scenario (late rounds with Committee duties)
  • ✅ Uses realistic timing (12s slots, proper round timeouts)
  • ❌ Doesn't test boundary conditions (exactly slot 2 when cleanup happens)
  • ❌ Doesn't verify intermediate state (which round the instance is actually in)
  • ❌ Not marked as failing/ignored, will break CI

Test Quality Recommendations

  1. Add intermediate assertions to verify instance progress:
// After slot 2, verify instance wasn't cleaned up
// After slot 11, verify we're in round 10
  1. Consider splitting into two tests:

    • Negative test: "test_cleanup_removes_old_instances" (passes today)
    • Positive test: "test_committee_can_reach_late_rounds" (ignored until fix)
  2. Add metrics or logging inspection to verify actual round progression


Documentation Assessment 📚

Strengths

  • Test has comprehensive inline comments explaining purpose
  • PR description clearly articulates the issue
  • Timeline breakdown helps understand the problem

Needs Improvement

  • Should reference a GitHub issue for tracking the fix
  • Comment calculations need correction (see inline comment)
  • Should document expected failure mode more explicitly

Broader Context: The 86 Files Changed

This PR shows 86 files changed because it's based on the stable branch but includes 32 other commits from unstable. The actual change for THIS feature is only 1 file: anchor/qbft_manager/src/tests.rs (+49 lines).

The other 85 files are unrelated changes that happened to be merged into unstable between when this branch diverged and now. This is expected for a long-lived branch but can make reviewing confusing.

Recommendation: Consider rebasing onto the latest stable branch to reduce noise, or clarify in the PR description that only the test file is the new change.


Critical Production Issue Identified 🚨

While reviewing this test, I identified that the root cause is a production-critical bug in anchor/qbft_manager/src/lib.rs:43:

const QBFT_RETAIN_SLOTS: u64 = 1;

This value is too aggressive for Committee duties:

  • Committee/Aggregator roles need ~41 slots to complete max_round=12
  • Current setting cleans up after just 2 slots
  • Proposer/SyncCommittee need ~11 slots for max_round=6

Recommended Fix (in a separate PR):

  1. Immediate: Increase to QBFT_RETAIN_SLOTS = 50 to accommodate worst-case
  2. Better: Implement role-based retention policies
  3. Best: Event-based cleanup (like go-ssv) - remove on new duty start, not time-based

Overall Assessment

Category Rating Notes
Code Quality ⚠️ Needs Work Test logic sound, but execution state unclear
Security ✅ Good No concerns
Performance ✅ Good Efficient test design
Testing ⚠️ Needs Work Will break CI, needs #[ignore] or restructuring
Documentation ⚠️ Needs Work Math errors, unclear expectations

Recommendations

Before Merging (Required)

  1. Fix the test execution state: Add #[ignore] or restructure as negative test
  2. Correct the timing calculations in comments (see inline reviews)
  3. Create a GitHub issue tracking the cleanup bug and reference it in comments
  4. Update PR description to clarify only 1 file is actually being changed

Consider (Suggested)

  1. Add intermediate assertions verifying instance survival and round progression
  2. Reduce loop bound from 50 to realistic value (22-25)
  3. Add a companion test verifying current cleanup behavior explicitly
  4. Consider rebasing onto latest stable to reduce diff noise

Follow-up Work (Separate PR)

  1. Fix the root cause: Adjust QBFT_RETAIN_SLOTS or implement role-based/event-based cleanup
  2. Remove #[ignore] from this test once fix is implemented
  3. Add tests for other duty types (Proposer, SyncCommittee, Aggregator)

Inline Comments

I've left 4 detailed inline comments on specific code sections. Please review each for technical details on:

  • Test logic issues (loop bounds)
  • Mathematical errors (timing calculations)
  • Test design improvements (boundary testing)
  • Documentation clarity (failure expectations)

Conclusion: This is valuable work that identifies a real production issue. With the recommended changes (particularly adding #[ignore] and fixing comment math), this test will serve as excellent documentation of the problem and a clear success criterion for the fix.


// Advance through multiple slots while QBFT progresses
// This triggers cleanup logic which should NOT remove the active instance
for slot in 1..=50 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Logic Issue: Unrealistic loop bound

The test loops to slot 50, but breaks at slot 22. This creates confusion about the test's intent. Consider these improvements:

  1. Use a more reasonable upper bound (e.g., 1..=25) that better reflects when you expect the test to complete
  2. Add assertion after the loop to verify we actually broke at slot 22 and didn't fall through
  3. Consider making this data-driven: Calculate the expected slot based on round timeout constants rather than hard-coding slot 22
const EXPECTED_ROUND_10_SLOT: u64 = 22; // Document the calculation
for slot in 1..=EXPECTED_ROUND_10_SLOT + 3 {  // Small buffer
    clock.set_slot(slot);
    tokio::time::sleep(slot_duration).await;
    
    if slot == EXPECTED_ROUND_10_SLOT {
        context.set_operators_online(&[2, 3, 4]);
        break;
    }
}
// Verify we didn't fall through
assert!(clock.now().unwrap().as_u64() == EXPECTED_ROUND_10_SLOT);

for slot in 1..=50 {
clock.set_slot(slot);
tokio::time::sleep(slot_duration).await;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematical Accuracy: Comment contains calculation error

The comment states "Rounds 1-8: 16s" but this is incorrect:

  • Rounds 1-8 each take 2 seconds (QUICK_TIMEOUT)
  • Total for rounds 1-8: 8 * 2s = 16s
  • But the comment should clarify this is cumulative, not per-round

Correct breakdown to round 10:

  • Rounds 1-8: 8 × 2s = 16s
  • Round 9: 120s
  • Round 10: 120s
  • Total: 256s = 21.33 slots

At 12s per slot:

  • 256s ÷ 12s/slot = 21.33 slots
  • So slot 22 is actually 8 seconds into round 10, not "around round 10"

Suggest updating comment to:

// At slot 22 (264 seconds = 22 * 12s):
// - Rounds 1-8: 8 * 2s = 16s
// - Round 9: 120s  
// - Total to complete round 9: 136s (11.33 slots)
// - Round 10 starts at 136s (slot 11), so at slot 22 (264s) we're 128s into round 10
if slot == 22 {

This makes the test's timing expectations explicit and verifiable.

// Keep 3 operators offline initially to prevent consensus and force round changes.
// With only 1 operator online out of 4, we cannot reach quorum (need 3).
// This will cause the instance to go through multiple round changes.
context.set_operators_offline(&[2, 3, 4]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Design: Consider testing the actual cleanup boundary

The test keeps 3 out of 4 operators offline to force round changes, which is good. However, it doesn't verify the specific failure mode mentioned in the PR description. Consider adding:

  1. Log inspection or metrics to verify the instance actually progresses through rounds 9-10
  2. Intermediate assertions to check the instance is still alive at critical points (e.g., after slot 2 when cleanup occurs)
  3. Test the boundary condition: What happens at exactly slot 2 when cleanup runs?

Example enhancement:

// Keep 3 operators offline to force round changes
context.set_operators_offline(&[2, 3, 4]);

// Advance to slot 2 where cleanup happens (cutoff = slot 2 - 1 = slot 1)
// At this point, the instance starting at slot 0 should be removed
clock.set_slot(1);
tokio::time::sleep(slot_duration).await;

clock.set_slot(2);
tokio::time::sleep(slot_duration).await;

// TODO: Add assertion here to verify instance is still alive
// This is where the bug manifests - instance gets cleaned up too early

This would make the test more explicitly demonstrate the issue described in the PR.

// The test simulates network conditions where consensus cannot be reached early by keeping
// all but one operator offline, forcing round changes. We advance the slot to trigger
// cleanup and verify the instance survives to reach round 10.
async fn test_committee_can_reach_late_rounds() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation: Test expectations need clarification

The comment says "Currently fails" but doesn't specify:

  1. What failure mode to expect - Does it panic? Timeout? Return an error?
  2. Whether this is expected behavior - Is this test marked with #[should_panic] or #[ignore]?
  3. When it should pass - What changes need to be made for this test to pass?

Recommendations:

  1. If the test is expected to fail, use #[ignore] or #[should_panic]:
#[tokio::test(start_paused = true)]
#[ignore = "Fails due to premature cleanup - see issue #XXX"]
async fn test_committee_can_reach_late_rounds() {
  1. Or, restructure as a negative test that explicitly verifies the current behavior:
#[tokio::test(start_paused = true)]
async fn test_committee_cleanup_prevents_late_rounds() {
    // Explicitly test that instances ARE cleaned up at slot 2
    // This documents current behavior before fix is implemented
  1. Add a GitHub issue reference so the test can be tracked to a fix

This prevents CI from failing and clearly communicates test intent to future developers.

@diegomrsantos diegomrsantos changed the base branch from stable to unstable October 25, 2025 16:35
@diegomrsantos diegomrsantos marked this pull request as draft October 28, 2025 20:48
Replace slot-based cleanup with duty-specific beacon chain inclusion
deadlines. This allows QBFT instances to progress through all configured
rounds without premature removal.

Key changes:
- Separate instance identity from manager metadata using ManagedInstance wrapper
- Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1)
- Add slots_per_epoch configuration parameter
- Implement dual-trigger cleaner (completion notification + deadline timeout)

Fixes instances being cleaned after 2 slots, now properly respecting
beacon chain inclusion windows (32-63 slots for attestations).
@diegomrsantos diegomrsantos added the claude-recheck triggers claude review workflow to re-run label Oct 29, 2025
@diegomrsantos diegomrsantos self-assigned this Oct 29, 2025
@diegomrsantos diegomrsantos added QBFT and removed claude-recheck triggers claude review workflow to re-run labels Oct 29, 2025
@diegomrsantos diegomrsantos marked this pull request as ready for review October 29, 2025 19:23
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Oct 29, 2025

Claude finished @diegomrsantos's task —— View job


Code Review Complete

I've thoroughly reviewed this PR, which adds both a test demonstrating a QBFT cleanup issue AND implements a comprehensive fix. This is excellent work that addresses a production-critical bug.


Summary

This PR successfully implements a duty-aware cleanup system for QBFT instances, replacing the overly aggressive slot-based cleanup that was preventing Committee instances from reaching late rounds (9-12). The solution is architecturally sound and follows best practices.

Key Achievement: Instances configured with max_round=12 can now properly complete all 12 rounds, as the cleanup respects beacon chain inclusion deadlines (32-63 slots for attestations per EIP-7045) instead of removing instances after just 2 slots.


Changes Overview

Commit 1: Test (82eaa98)

Adds test_committee_can_reach_late_rounds() demonstrating the cleanup issue.

Commit 2: Fix (4767f9d)

  • Replaces QBFT_RETAIN_SLOTS constant with duty-specific deadline calculation
  • Adds ManagedInstance wrapper to track deadline per instance
  • Implements calculate_deadline() function per EIP-7045
  • Adds slots_per_epoch parameter to QbftManager
  • Implements dual-trigger cleaner (completion notification + deadline timeout)

Code Quality Assessment

Strengths ✅

  1. Excellent Architecture: The separation of instance identity from manager metadata via ManagedInstance is clean and maintainable
  2. EIP-7045 Compliance: Deadline calculation correctly implements attestation validity rules
  3. Dual-Trigger Cleanup: Smart design - cleanup on completion OR deadline expiry
  4. Comprehensive Testing: Test effectively demonstrates both the problem and validates the fix
  5. Well-Documented: Comments clearly explain intent and timing calculations
  6. Proper Error Handling: Uses Result types appropriately throughout

Issues Identified ⚠️

1. Test Comment Math Error (Minor - Documentation)

Location: anchor/qbft_manager/src/tests.rs:942-943

The comment states timing is "256 seconds" but calculates incorrectly. At slot 22 (264 seconds), we're actually 128 seconds into round 10, not "around round 10". See inline comment for detailed correction.

Impact: Low - doesn't affect test correctness, but could confuse future maintainers.

2. Unrealistic Loop Bound (Minor - Code Quality)

Location: anchor/qbft_manager/src/tests.rs:938

Loop bound is 50 but test always breaks at 22, creating confusion about intent. See inline comment for suggestion.

Impact: Low - test works correctly but readability suffers.

3. Missing Intermediate Assertions (Minor - Test Coverage)

Test doesn't explicitly verify instance survives cleanup at slot 2 (the critical failure point in old code) or that instance actually progresses through rounds 9-10.

Impact: Low - current test validates end-to-end behavior, but doesn't document the specific failure mode being fixed.

4. Missing Documentation (Minor - Maintainability)

ManagedInstance struct and calculate_deadline() function could benefit from enhanced documentation. See inline comments for suggestions.

Impact: Low - code is clear, but doc comments would help future maintainers.


Security Assessment 🔒

No security concerns identified. Changes are focused on improving reliability without introducing attack vectors.

Positive Security Impact:

  • Fixes a DoS-like condition where legitimate instances could be prematurely cleaned up
  • Instances now survive as long as beacon chain allows inclusion, improving liveness

Performance Assessment ⚡

Improvements ✅

  1. Reduced Lock Contention: Using DashMap with per-instance metadata is efficient
  2. Dual-Trigger Cleanup: Immediate cleanup on completion prevents memory leaks
  3. Test Performance: Good use of start_paused = true for deterministic timing

No Performance Regressions ✅

  • Deadline calculation is O(1) per instance creation
  • Cleanup loop runs once per slot (same as before)
  • Memory footprint actually improved (instances cleaned sooner on completion)

Testing Assessment 🧪

Test Coverage ✅

  • Test validates the critical scenario (Committee reaching late rounds)
  • Test uses realistic timing (12s slots, proper round timeouts)
  • Test properly simulates network conditions (keeping operators offline)
  • verify_consensus() confirms successful completion

What's Missing ⚠️

  1. No negative test verifying old behavior would have failed
  2. No boundary testing for the cleanup at slot 2 specifically
  3. No tests for other roles (Proposer, SyncCommittee, Aggregator deadlines)
  4. No test for deadline expiry (what happens when deadline passes without completion)

Recommendations

Consider adding follow-up tests:

#[tokio::test]
async fn test_proposer_deadline_shorter_than_committee() { }

#[tokio::test]  
async fn test_instance_cleanup_on_deadline_expiry() { }

#[tokio::test]
async fn test_immediate_cleanup_on_completion() { }

Documentation Assessment 📚

Strengths ✅

  • Test has comprehensive inline comments
  • PR description clearly articulates the problem and solution
  • Commit messages follow conventional commits format
  • Code comments explain non-obvious logic

Areas for Improvement ⚠️

  1. Math error in test comment (see inline comment on tests.rs:943)
  2. calculate_deadline() could use more detail about EIP-7045 reasoning (see inline comment on lib.rs:45)
  3. Missing doc comments on ManagedInstance struct (see inline comment on lib.rs:119)

Idiomatic Rust & Best Practices

Excellent Rust Usage ✅

  1. Type Safety: ManagedInstance provides type-safe deadline tracking
  2. Error Handling: Proper use of Result and ? operator throughout
  3. Async Design: Clean use of tokio::select! for dual-trigger pattern
  4. Zero Unsafe: No unsafe code needed
  5. Memory Safety: Proper use of Arc, DashMap, and channels

Follows Repo Guidelines ✅

  • Modular design with clear boundaries ✅
  • Error types with context ✅
  • Async/await with Tokio ✅
  • Clear, descriptive naming ✅
  • Comments explain "why" not "what" ✅

Specific Technical Review

calculate_deadline() Function

Location: anchor/qbft_manager/src/lib.rs:43-60

Analysis: ✅ Excellent implementation

  • Correctly implements EIP-7045 for attestations
  • Proposer/SyncCommittee must be same-slot (correct)
  • Voluntary exits get one epoch (reasonable)
  • No overflow handling needed (epoch arithmetic safe in practice)

ManagedInstance Struct

Location: anchor/qbft_manager/src/lib.rs:119-122

Analysis: ✅ Clean separation of concerns

  • Separates identity (map key) from metadata (deadline + sender)
  • Allows different deadlines for different instance types
  • Type parameter preserves type safety

Cleaner Implementation

Location: anchor/qbft_manager/src/lib.rs:328-366

Analysis: ✅ Excellent dual-trigger design

  • Immediate cleanup on completion prevents memory leaks
  • Slot-based sweep removes expired instances
  • Handles both normal and timeout cases
  • Properly uses select! for concurrent event handling

No issues found - this is production-ready code.


Breaking Changes

None. This is a pure enhancement that:

  • Maintains backward compatibility (same public API)
  • Only changes internal cleanup logic
  • Tests demonstrate existing functionality still works

Overall Assessment

Category Rating Notes
Code Quality ✅ Excellent Clean, maintainable, follows best practices
Security ✅ Good No concerns, improves liveness
Performance ✅ Good No regressions, potentially better
Testing ✅ Good Comprehensive, could add more edge cases
Documentation ⚠️ Good Minor math error, otherwise excellent
Architecture ✅ Excellent Well-designed, scalable solution

Recommendations

Before Merging (Optional Improvements)

  1. Fix test comment math at line 942-943 (see inline comment)
  2. Reduce loop bound from 50 to 25 for clarity (see inline comment)
  3. Add doc comment to ManagedInstance explaining its purpose (see inline comment)
  4. Enhance calculate_deadline comment with EIP-7045 calculation detail (see inline comment)

Follow-up Work (Separate PRs)

  1. Add tests for other duty types (Proposer, SyncCommittee)
  2. Add test for deadline expiry behavior
  3. Add test verifying immediate cleanup on completion
  4. Consider adding metrics for instance cleanup (completion vs deadline)

Conclusion

This is excellent work that solves a real production issue with a well-architected solution. The code is clean, follows Rust best practices, and adheres to the project's guidelines. The few minor issues identified are documentation improvements rather than functional problems.

Recommendation: ✅ Approve with minor documentation improvements

The PR is ready to merge as-is. The suggested documentation improvements would make the code even better but are not blockers.


Review Summary: I've left 4 inline comments with specific suggestions for documentation improvements. All issues are minor and don't block merging. The implementation is sound and production-ready.

tokio::time::sleep(slot_duration).await;

// At slot 22 (256 seconds = 16s + 240s), we should be around round 10
// Rounds 1-8: 16s, Rounds 9-10: 240s = 256s total

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math Error in Comment

The comment's calculation is incorrect. Here's the accurate breakdown:

Correct Timing:

  • Rounds 1-8: 8 × 2s = 16s (completes at ~slot 1.33)
  • Round 9: 120s (completes at 136s / ~slot 11.33)
  • Round 10: 120s (starts at 136s)
  • At slot 22 (264s): We're 128 seconds into round 10 (with 8s remaining)

Suggested fix:

// At slot 22 (264 seconds = 22 * 12s):
// - Rounds 1-8: 8 * 2s = 16s (completes at slot 1.33)
// - Round 9: 120s (completes at 136s / slot 11.33)
// - Round 10: started at 136s, now 128s into it
if slot == 22 {

This makes the timing expectations explicit and verifiable.

// Map from an identifier to a sender for the instance
type Map<I, D> = DashMap<I, UnboundedSender<QbftMessage<D>>>;
// Manager's bookkeeping for an instance
pub struct ManagedInstance<D: QbftData> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider Adding Documentation

The ManagedInstance struct is a key part of the new architecture but lacks documentation. Consider adding:

/// Manager's bookkeeping for a QBFT instance.
/// 
/// Tracks the communication channel and beacon chain inclusion deadline
/// for each active instance. The `deadline` field determines when the
/// instance should be cleaned up if not completed earlier. Instances are
/// removed either when they complete (via completion notification) or when
/// their deadline expires (via the periodic cleaner sweep).
pub struct ManagedInstance<D: QbftData> {
    sender: UnboundedSender<QbftMessage<D>>,
    deadline: types::Slot,
}

This helps future maintainers understand the purpose and lifecycle management.

/// Calculate the beacon chain inclusion deadline for a duty
fn calculate_deadline(role: Role, slot: types::Slot, slots_per_epoch: u64) -> types::Slot {
match role {
Role::Committee | Role::Aggregator => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance Comment for Clarity

The calculation (epoch.as_u64() + 2) * slots_per_epoch - 1 implements EIP-7045 correctly, but the "why" could be clearer:

Suggested enhancement:

Role::Committee | Role::Aggregator => {
    // Attestations can be included until end of next epoch (epoch E+1)
    // Per EIP-7045: attestation from epoch E valid until end of epoch E+1
    // 
    // Calculation explanation:
    // - Epoch E+1 starts at: (E+1) * slots_per_epoch
    // - Epoch E+1 ends at: (E+2) * slots_per_epoch - 1 (last slot of E+1)
    let epoch = slot.epoch(slots_per_epoch);
    types::Slot::new((epoch.as_u64() + 2) * slots_per_epoch - 1)
}

This makes it immediately clear why we use E+2 in the calculation.

@dknopik dknopik added the v1.2.0 label Nov 7, 2025
@diegomrsantos diegomrsantos changed the title test: add test for Committee instances reaching late rounds refactor: use beacon chain deadlines for QBFT instance cleanup Nov 12, 2025
diegomrsantos and others added 2 commits November 12, 2025 14:09
Improve test readability by applying Setup/Execute/Assert structure:

- Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.)
- Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests
- Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role)
- Add descriptive assertion messages explaining what must be true
- Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.)

All 23 tests pass (up from 18 due to role deadline test split).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add Setup/Execute/Assert pattern as mandatory requirement for all tests:

CLAUDE.md:
- Add "Writing Tests" section requiring tester-subagent usage before writing any test code
- Ensures consistent test structure across codebase

tester-subagent.md:
- Add section 4 "Test Structure Requirements" with generic example
- Define 4 mandatory requirements: section comments, named constants, assertion messages, one behavior per test
- List anti-patterns to reject (no section comments, magic numbers, silent assertions, mixed code)
- Update description to emphasize "MUST BE USED before writing ANY test code"

This ensures all new tests follow readable, maintainable patterns.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add detailed explanation for Committee/Aggregator deadline calculation:
- Document the calculation formula: (E+2) * slots_per_epoch - 1
- Explain that this represents the last slot for on-chain inclusion
- Reference EIP-7045 specification

Enhance ManagedInstance documentation:
- Convert to doc comment for better API documentation
- Clarify that it tracks both channel and beacon chain deadline
- Explain its role in the cleanup task

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants