Skip to content

QBFT instances cleaned up too aggressively, preventing late rounds #720

@diegomrsantos

Description

@diegomrsantos

Context

Committee and Aggregator duties are configured with max_round = 12 in QBFT, allowing instances to progress through 12 rounds of consensus before timing out. Given the round timeout structure:

  • Rounds 1-8: 2 seconds each (16 seconds total)
  • Rounds 9-12: 120 seconds each (480 seconds total)

An instance needs approximately 496 seconds (~41 slots) to complete all 12 configured rounds.

Current Behavior

Anchor's QBFT cleanup runs every slot with QBFT_RETAIN_SLOTS = 1, removing instances after only 2 slots (24 seconds).

Code: anchor/qbft_manager/src/lib.rs:277-295

async fn cleaner(self: Arc<Self>, slot_clock: impl SlotClock) {
    while !self.processor.permitless.is_closed() {
        sleep(
            slot_clock
                .duration_to_next_slot()
                .unwrap_or(slot_clock.slot_duration()),
        )
        .await;
        let Some(slot) = slot_clock.now() else {
            continue;
        };
        let cutoff = slot.saturating_sub(QBFT_RETAIN_SLOTS);
        self.beacon_vote_instances
            .retain(|k, _| *k.instance_height >= cutoff.as_usize());
        self.validator_consensus_data_instances
            .retain(|k, _| *k.instance_height >= cutoff.as_usize());
    }
}

Timeline for instance at slot 100:

  • Slot 100: Instance created
  • Slot 101: Cleaner keeps instance (cutoff = 100, instance 100 >= 100)
  • Slot 102: Cleaner removes instance (cutoff = 101, instance 100 < 101)

Result: Instance lifetime is 2 slots (24 seconds). The instance reaches round 9 but is killed 8 seconds into its 120-second timeout, preventing completion of round 9 or reaching rounds 10-12.

Comparison with Go-SSV

Go-SSV uses event-based cleanup - instances are only removed when starting a new duty instance, not on a fixed time schedule.

Code: controller.go:StartNewInstance()

func (c *Controller) StartNewInstance(...) {
    // ... create and start new instance ...
    c.forceStopAllInstanceExceptCurrent()  // Cleanup only when new duty starts
}

Since attestation duties occur once per epoch (32 slots), instances live for ~32 slots (384 seconds), sufficient to reach round 11 of the configured 12 maximum rounds.

Impact

Committee and Aggregator instances cannot utilize their full fault tolerance configuration. Despite being configured with max_round = 12, instances reach round 9 but are cleaned up before completing it, preventing the protocol from reaching rounds 10-12 during adverse network conditions.

Reproduction

See PR #719 which adds a test demonstrating this behavior: test_committee_can_reach_late_rounds() fails because the instance is cleaned up at slot 2 while trying to reach round 10.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions