QBFT instances cleaned up too aggressively, preventing late rounds

## Context

Committee and Aggregator duties are configured with `max_round = 12` in QBFT, allowing instances to progress through 12 rounds of consensus before timing out. Given the round timeout structure:
- Rounds 1-8: 2 seconds each (16 seconds total)
- Rounds 9-12: 120 seconds each (480 seconds total)

An instance needs approximately 496 seconds (~41 slots) to complete all 12 configured rounds.

## Current Behavior

Anchor's QBFT cleanup runs **every slot** with `QBFT_RETAIN_SLOTS = 1`, removing instances after only 2 slots (24 seconds).

**Code:** `anchor/qbft_manager/src/lib.rs:277-295`

```rust
async fn cleaner(self: Arc<Self>, slot_clock: impl SlotClock) {
    while !self.processor.permitless.is_closed() {
        sleep(
            slot_clock
                .duration_to_next_slot()
                .unwrap_or(slot_clock.slot_duration()),
        )
        .await;
        let Some(slot) = slot_clock.now() else {
            continue;
        };
        let cutoff = slot.saturating_sub(QBFT_RETAIN_SLOTS);
        self.beacon_vote_instances
            .retain(|k, _| *k.instance_height >= cutoff.as_usize());
        self.validator_consensus_data_instances
            .retain(|k, _| *k.instance_height >= cutoff.as_usize());
    }
}
```

**Timeline for instance at slot 100:**
- Slot 100: Instance created
- Slot 101: Cleaner keeps instance (cutoff = 100, instance 100 >= 100)
- Slot 102: Cleaner removes instance (cutoff = 101, instance 100 < 101)

**Result:** Instance lifetime is 2 slots (24 seconds). The instance reaches round 9 but is killed 8 seconds into its 120-second timeout, preventing completion of round 9 or reaching rounds 10-12.

## Comparison with Go-SSV

Go-SSV uses **event-based cleanup** - instances are only removed when starting a new duty instance, not on a fixed time schedule.

**Code:** [`controller.go:StartNewInstance()`](https://github.com/ssvlabs/ssv/blob/main/protocol/v2/qbft/controller/controller.go#L149-L180)

```go
func (c *Controller) StartNewInstance(...) {
    // ... create and start new instance ...
    c.forceStopAllInstanceExceptCurrent()  // Cleanup only when new duty starts
}
```

Since attestation duties occur once per epoch (32 slots), instances live for ~32 slots (384 seconds), sufficient to reach round 11 of the configured 12 maximum rounds.

## Impact

Committee and Aggregator instances cannot utilize their full fault tolerance configuration. Despite being configured with `max_round = 12`, instances reach round 9 but are cleaned up before completing it, preventing the protocol from reaching rounds 10-12 during adverse network conditions.

## Reproduction

See PR #719 which adds a test demonstrating this behavior: `test_committee_can_reach_late_rounds()` fails because the instance is cleaned up at slot 2 while trying to reach round 10.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QBFT instances cleaned up too aggressively, preventing late rounds #720

Context

Current Behavior

Comparison with Go-SSV

Impact

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QBFT instances cleaned up too aggressively, preventing late rounds #720

Description

Context

Current Behavior

Comparison with Go-SSV

Impact

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions