-
Notifications
You must be signed in to change notification settings - Fork 296
BN: New syncing algorithm #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to change (parts of) holesky/sepolia/hoodi over to this branch?
Back for goerli/prater, I found this very helpful for testing, as merging to unstable (even with subsequent revert) was sketchy, but not having it deployed anywhere was also not very fruitful.
The status-im/infra-nimbus repo controls the branch that is used, and it is automatically rebuilt daily. One can pick the branch also for a subset of nodes (in around ~25% increments), and there is also a command to resync those nodes.
my scratchpad from goerli / holesky times, with instructions on how to connect to those servers, how to view the logs, how to restart them, and how to monitor their metrics:
FLEET:
Hostnames: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=geth-03.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-testing&from=now-24h&to=now&refresh=15m
look at the instance/container dropdowns
the pattern should be fairly clear
then, to SSH to them, add .status.im
get a SSH access from jakub, tell him your SSH key (the correct half), and connect using -i the_other_half to etan@unstable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net
> geth-01.ih-eu-mda1.nimbus.holesky.statusim.net (was renamed to status.im)
geth-01.ih-eu-mda1.nimbus.holesky.status.im
https://github.com/status-im/infra-nimbus/blob/0814b659654bb77f50aac7d456767b1794145a63/ansible/group_vars/all.yml#L23
sudo systemctl --no-block start build-beacon-node-holesky-unstable && journalctl -fu build-beacon-node-holesky-unstable
restart fleet
for a in {erigon,neth,geth}-{01..10}.ih-eu-mda1.nimbus.holesky.statusim.net; do ssh -o StrictHostKeychecking=no $a 'sudo systemctl --no-block start build-beacon-node-holesky-unstable'; done
tail -f /data/beacon-node-prater-unstable/logs/service.log
1352c68 to
d9979f5
Compare
|
I've opened an issue for testing of this branch: Please comment in it when you think the branch is ready for that. |
4e7f97f to
07813c7
Compare
Remove any changes to callbacks.
Fix missingSidecars flag calculation in Gossip event handler.
Add SyncDag pruning on finalized epoch change.
Add maintenance loop to keep block buffers properly cleaned up.
This is high level description of new syncing algorithm.
First of all lets define some terms.
peerStatusCheckpoint- Peer's latest finalizedCheckpointreported viastatusrequest.peerStatusHead- Peer's latest headBlockIdreported viastatusrequest.lastSeenCheckpoint- Its the latest finalizedCheckpointreported by our current set of peers, e.g.max(peerStatusCheckpoint.epoch).lastSeenHead- Its the latest headBlockIdreported by our current set of peers, e.g.max(peerStatusHead.slot).finalizedDistance=lastSeenCheckpoint.epoch - dag.headState.finalizedCheckpoint.epoch.wallSyncDistance=beaconClock.now().slotOrZero - dag.head.slot.Every peer we get from PeerPool will start its loop:
statusinformation if its too "old", and "old" depends on current situation:1.1. Update
statusinformation when forward syncing is active - every10 * SECONDS_PER_SLOTseconds.1.2. Update
statusinformation everySECONDS_PER_SLOTperiod whenpeerStatusHead.slot.epoch - peerStatusCheckpoint.epoch >= 3(which means that there is some period of non-finality).1.3. In all other cases node updates
statusinformation every5 * SECONDS_PER_SLOTseconds.by rootrequests, where roots are received fromsync_dagmodule. IffinalizedDistance() < 4epochs it will do:2.1. Request by root blocks in range of
[PeerStatusCheckpoint.epoch.start_slot, PeerStatusHead.slot].2.2. Request by root sidecars in range
[getForwardSidecarSlot(), PeerStatusHead.slot].finalizedDistance() > 1epochs it will do:3.1. Request by range blocks in range of
[dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].3.2. Request by range sidecars in range
[dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].wallSyncDistance() < 1(backfill process should not affect syncing status, so we pause backfill if node lost synced status) it will do:3.1. Request by range blocks in range of
[dag.backfill.slot, getFrontfillSlot()].3.2. Request by range sidecars in range of
[dag.backfill.slot, getBackfillSidecarSlot()].5.1. In case when peer providing use with some information -
no pause.5.2. In case when endless loop detected (for some unknown reason peer not provided any information) -
1.secondspause.5.3. In case when we finished syncing -
N secondsup to next slot.Also new SyncOverseer catches number of EventBus events, so it could maintain
sync_dagstructures.SyncManagerandRequestManagergot deprecated and removed from codebase.The core problem of
SyncManageris that it could work withBlobSidecars, but could not work withDataColumnSidecar. Because not all columns are available immediately, so it impossible to download blocks and columns in one step, like it was done inSyncManager.Same problem exists in
RequestManager, right nowRequestManagerwhen have missing parent just randomly selects 2 peers (without any filtering) and tries to download blocks and sidecars from this peers. If inBlobSidecarage it will work in most of the cases, inDataColumnSidecarage the probability of success is much more lower...