-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Context
Committee and Aggregator duties are configured with max_round = 12 in QBFT, allowing instances to progress through 12 rounds of consensus before timing out. Given the round timeout structure:
- Rounds 1-8: 2 seconds each (16 seconds total)
- Rounds 9-12: 120 seconds each (480 seconds total)
An instance needs approximately 496 seconds (~41 slots) to complete all 12 configured rounds.
Current Behavior
Anchor's QBFT cleanup runs every slot with QBFT_RETAIN_SLOTS = 1, removing instances after only 2 slots (24 seconds).
Code: anchor/qbft_manager/src/lib.rs:277-295
async fn cleaner(self: Arc<Self>, slot_clock: impl SlotClock) {
while !self.processor.permitless.is_closed() {
sleep(
slot_clock
.duration_to_next_slot()
.unwrap_or(slot_clock.slot_duration()),
)
.await;
let Some(slot) = slot_clock.now() else {
continue;
};
let cutoff = slot.saturating_sub(QBFT_RETAIN_SLOTS);
self.beacon_vote_instances
.retain(|k, _| *k.instance_height >= cutoff.as_usize());
self.validator_consensus_data_instances
.retain(|k, _| *k.instance_height >= cutoff.as_usize());
}
}Timeline for instance at slot 100:
- Slot 100: Instance created
- Slot 101: Cleaner keeps instance (cutoff = 100, instance 100 >= 100)
- Slot 102: Cleaner removes instance (cutoff = 101, instance 100 < 101)
Result: Instance lifetime is 2 slots (24 seconds). The instance reaches round 9 but is killed 8 seconds into its 120-second timeout, preventing completion of round 9 or reaching rounds 10-12.
Comparison with Go-SSV
Go-SSV uses event-based cleanup - instances are only removed when starting a new duty instance, not on a fixed time schedule.
Code: controller.go:StartNewInstance()
func (c *Controller) StartNewInstance(...) {
// ... create and start new instance ...
c.forceStopAllInstanceExceptCurrent() // Cleanup only when new duty starts
}Since attestation duties occur once per epoch (32 slots), instances live for ~32 slots (384 seconds), sufficient to reach round 11 of the configured 12 maximum rounds.
Impact
Committee and Aggregator instances cannot utilize their full fault tolerance configuration. Despite being configured with max_round = 12, instances reach round 9 but are cleaned up before completing it, preventing the protocol from reaching rounds 10-12 during adverse network conditions.
Reproduction
See PR #719 which adds a test demonstrating this behavior: test_committee_can_reach_late_rounds() fails because the instance is cleaned up at slot 2 while trying to reach round 10.