Skip to content

Conversation

@petarjuki7
Copy link
Member

Issue Addressed

Adresses issue #519

Proposed Changes

We keep track of all OperatorAdded events in a new variable stored in the database (max_operator_id_seen).
Then, when we get a new OperatorAdded event, we can check does it match the expected OperatorId we should have seen next.

Also, during historical sync we check if we saw any OperatorAdded event, if not, there is likely a sync error.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds tracking and validation for OperatorAdded events to detect synchronization issues by monitoring sequence gaps and missing events during historical sync.

  • Introduces max_operator_id_seen field to track the highest operator ID encountered
  • Validates that OperatorAdded events arrive in sequential order during live sync
  • Warns if no OperatorAdded events are found during historical synchronization

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
anchor/eth/src/event_processor.rs Adds validation logic for OperatorAdded event sequence and missing events detection
anchor/database/src/table_schema.sql Adds max_operator_id_seen column to metadata table
anchor/database/src/state.rs Implements getter for max_operator_id_seen from database and state
anchor/database/src/sql_operations.rs Defines SQL queries for retrieving and updating max_operator_id_seen
anchor/database/src/lib.rs Adds max_operator_id_seen field to SingleState and bump operation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Here are my thoughts.

If we decide on not persisting the max operator id seen, I think the implementation might be straight forward enough to warrant inclusion in v1.0.0. wdyt?

domain_type INTEGER NOT NULL,
block_number INTEGER NOT NULL DEFAULT 0 CHECK (block_number >= 0)
block_number INTEGER NOT NULL DEFAULT 0 CHECK (block_number >= 0),
max_operator_id_seen INTEGER NOT NULL DEFAULT 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I am not sure if we should persist this. It might be sufficient to keep this as an Option<OperatorId> during sync, initialized on the first seen operatorAdded, and checked subsequently during this anchor instances lifetime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about persistence also... I think the problem may be with the historical sync because, if I understood the code correctly, after we do it for the first time we persist the block data, and on the next startup of the anchor instance we only sync from the last_processed_block, but our operator_seen count would start from 0 which could cause inconsistencies.

Copy link
Member

@dknopik dknopik Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. This is why I suggest using an last_operator_add_seen: Option<OperatorId>, initialized to None. To avoid the inconsistency you mentioned, we would not do the check on the first operatorAdded we see, but use it to initialize the last_operator_add_seen. Then we can do the check when receiving more operatorAdded.

The downside is that we do not catch missing operatorsAdded between the last run and the first operatorAdded we get in the current run. I am on the fence whether that is alright.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I understand.
The downside might be fine, because it if it's truly not catching the events, we will find out soon enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would not do the check on the first operatorAdded we see, but use it to initialize the last_operator_add_see

That's the crux of the issue, isn't it? A pruned client could have deleted the state containing some initial operators

@diegomrsantos
Copy link
Member

The current approach might not be enough. Does the operator id start at zero or one and then continue increasing by 1? If so, we need to guarantee that we have seen all of them up to the current value in the contract.

@petarjuki7
Copy link
Member Author

we need to guarantee that we have seen all of them up to the current value in the contract.

I think that was the downside discussed about this approach when we don't persist the max_operator_id_seen variable. We kind of "trust" the first OperatorAdded event we see will initialize the variable with the highest OperatorId.

I think the purpose of this PR is more to warn users that something with syncing is wrong so they configure their EL client correctly.

But I could implement a check that goes through our saved OperatorIds in the database and checks if they are monotonically increasing by 1 up until the max_seen?

@diegomrsantos
Copy link
Member

What I described is a suggestion about how to check if Anchor has synced correctly starting with an empty DB. Additionally, we also need to persist the highest operator id we have seen to be able to do the check when we start syncing after being offline.

@diegomrsantos
Copy link
Member

I think we do have to persist the highest_operator_id_seen. It will be created initially with the first value used by the contract.

But I could implement a check that goes through our saved OperatorIds in the database and checks if they are monotonically increasing by 1

This needs to be done during sync, starting with the current value in the DB.

@diegomrsantos
Copy link
Member

And before we start syncing, we could read the last operator id from the ssv contract and check that see all ids up to this value. But not sure it's necessary

@petarjuki7
Copy link
Member Author

Yeah, I understand the idea, can do it.
Just tagging @dknopik to confirm, if I remember correctly we decided on the non-persistence path yesterday.

@dknopik
Copy link
Member

dknopik commented Sep 24, 2025

@diegomrsantos In yesterdays meeting we discussed that we can also not persist the highest_operator_id_seen and initialize it with the first operator ID seen per sync. This would be sufficient to catch any missing operatorAdded for the lifetime of that instance.

In the meeting, I asked if there is agreement to do it this way for now, as it is a tradeoff for not having to modify the database that closely to the release. While it is of course okay to change one's mind, I believe the simpler option is better for a short term change

@dknopik
Copy link
Member

dknopik commented Sep 24, 2025

@petarjuki7 please rebase onto release-v1.0.0

@petarjuki7 petarjuki7 changed the base branch from unstable to release-v1.0.0 September 24, 2025 11:58
@petarjuki7 petarjuki7 requested a review from dknopik September 24, 2025 12:44
but got operator {operator_id}."
);
}
self.db.bump_max_operator_id_seen();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a bit cleaner to just have set_max_operator_id_seen

@dknopik dknopik changed the base branch from release-v1.0.0 to unstable October 20, 2025 13:20
Comment on lines 194 to 197
error!(
"Missing OperatorAdded events: database has only seen up to id {max_seen}, \
but got operator {operator_id}."
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss whether we want to keep this a log-only soft error or return an error here to signal that the sync has failed. I think personally, I prefer the latter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I encountered while testing is that by returning an Error(...) type, it gets handled in the process_logs() function, and if its a historical sync it only gets logged as a trace. So it's practically invisible without running the client with RUST_LOG = trace or similar, its not really a "loud warning" as stated in the issue description.
Maybe keep the error!() log and also return an Error() ?

@petarjuki7 petarjuki7 force-pushed the warn_missing_logs branch 2 times, most recently from 4b318bb to 285b6d5 Compare November 7, 2025 13:26
@dknopik dknopik added the v1.2.0 label Nov 7, 2025
@petarjuki7 petarjuki7 added the ready-for-review This PR is ready to be reviewed label Nov 10, 2025
@petarjuki7 petarjuki7 requested a review from dknopik November 10, 2025 04:13
Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@dknopik dknopik added waiting-on-author and removed ready-for-review This PR is ready to be reviewed labels Nov 12, 2025
@petarjuki7 petarjuki7 added ready-for-review This PR is ready to be reviewed and removed waiting-on-author labels Nov 12, 2025
@diegomrsantos
Copy link
Member

@claude review this PR. Is it worth moving forward with this approach, given https://eips.ethereum.org/EIPS/eip-4444 and the fact that we are currently investigating a solution similar to Ethereum's checkpoint sync to stop relying on all logs existing in execution clients?

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Nov 12, 2025

Claude finished @diegomrsantos's task —— View job


Review Progress

  • Read and analyze changed files
  • Review code quality and correctness
  • Assess strategic value given EIP-4444 and checkpoint sync
  • Provide comprehensive feedback

Code Review

I've reviewed the PR implementation. Here's my assessment:

Implementation Quality

Positives:

  1. Clean migration path: The schema migration from v1 to v2 is properly implemented with the MIGRATION_V1_TO_V2 script (anchor/database/src/schema.rs:16-19)
  2. Good test coverage: The migration test in anchor/database/src/tests/metadata_tests.rs:172-209 validates both the upgrade process and NULL default behavior
  3. Proper NULL handling: Migrated databases correctly have max_operator_id_seen = NULL to distinguish from fresh databases
  4. Sequential validation: The check at anchor/eth/src/event_processor.rs:194-201 properly validates operator IDs are sequential

Issues Found:

  1. DEFAULT 0 vs NULL inconsistency (anchor/database/src/table_schema.sql:8):

    • The schema uses DEFAULT 0 but the migration test expects NULL for migrated databases
    • @dknopik's latest comment correctly identifies this: fresh databases should start at 0, but the migration adds the column without a default
    • Fix needed: Change line 8 to just max_operator_id_seen INTEGER (no default), then update the migration to explicitly set it to 0 for fresh inserts
  2. Logic error in validation (anchor/eth/src/event_processor.rs:195):

    • The check max_seen != operatorId - 1 is correct for detecting gaps
    • However, this assumes operator IDs start at 1 and increment by 1
    • Question: Does the contract guarantee operator IDs start at 1? If they start at 0, the first check will fail
  3. Missing initialization on first operator (anchor/eth/src/event_processor.rs:193-201):

    • When max_seen is None (migrated DB), the code skips validation and just sets the current ID
    • This is intentional per the discussion, but means we lose gap detection for the first sync after migration
    • This is the accepted trade-off mentioned in comments
  4. Warning placement (anchor/eth/src/sync.rs:514-522):

    • The warning only triggers after historical sync completes
    • Good placement, but consider: what if operators were added before the contract deployment block being synced from?
    • Might want to clarify this is "no operators found since block X"

Strategic Assessment: EIP-4444 and Checkpoint Sync

You asked whether this approach is worth pursuing given EIP-4444 and checkpoint sync plans. Here's my analysis:

Short Answer: This is a valuable transitional solution, but has limited long-term value.

Detailed Reasoning:

  1. EIP-4444 Impact:

    • EIP-4444 allows execution clients to prune logs older than ~1 year
    • This PR's approach will fail once logs are pruned, because:
      • max_operator_id_seen needs full historical sync to detect gaps
      • If logs are missing, the monotonic check becomes unreliable
      • The warning at sync.rs:521 will trigger constantly
  2. Checkpoint Sync Alternative:

    • A checkpoint sync approach would provide:
      • Initial state snapshot (including all operator IDs)
      • Ability to verify completeness against snapshot
      • No dependency on historical logs
    • This makes the max_operator_id_seen approach obsolete
  3. Current Value:

    • Before EIP-4444 is widely adopted: This provides user value by detecting misconfigured EL clients
    • Detection window: Useful for ~1 year after deployment (until pruning kicks in)
    • User experience: Helps users diagnose sync issues NOW, when they need it

Recommendation:

Proceed with this PR with the understanding that:

  • It's a tactical fix for the current problem (issue Loudly warn if logs are not being served #519)
  • It has a limited lifespan (~1-2 years until EIP-4444 adoption)
  • It should be documented as temporary with a note about EIP-4444 limitations
  • Consider adding a migration path or deprecation plan for when checkpoint sync is ready

Alternative Consideration:

If checkpoint sync is coming soon (3-6 months), you might:

  1. Skip this PR and focus efforts on checkpoint sync instead
  2. Add a simpler check (e.g., "warn if <10 operators found") without persistence
  3. Document the limitation in release notes and recommend users verify EL configuration

Trade-offs:

Aspect This PR Wait for Checkpoint Sync
User value today ✅ Immediate detection ❌ No help for current issues
Long-term maintenance ⚠️ Will need removal later ✅ Permanent solution
Development effort ⚠️ ~1 day work, now ⚠️ ~weeks of work, later
EIP-4444 compatibility ❌ Breaks when logs pruned ✅ Works with pruned logs

Required Changes

Before merging, address:

  1. Fix DEFAULT vs NULL in table_schema.sql:8 per dknopik's comment
  2. Clarify operator ID starting value - verify the contract starts at 1, not 0
  3. Add documentation about EIP-4444 limitations in code comments
  4. Consider adding an issue reference in comments for future removal when checkpoint sync arrives

Verdict

The implementation is sound for its intended purpose, but its strategic value is time-limited. If you need a solution NOW and checkpoint sync is >6 months away, merge it. If checkpoint sync is imminent, consider simpler alternatives.


@petarjuki7
Copy link
Member Author

Clarify operator ID starting value - verify the contract starts at 1, not 0

It does, but I agree it's a good thing to point out / double check

Good placement, but consider: what if operators were added before the contract deployment block being synced from?

By definition they shouldn't be.

For other points I agree but EIP-4444 doesn't seem to be arriving soon, and there will be a number of users receiving this error (unfortunately) so I think it makes sense to include it into 1.2.0.

@jking-aus
Copy link
Member

hey just chiming in to say four fours is realistically not happening in the next two forks so we dont need to worry about that. We are however having a consistent problem where users aren't aware their sync is working because of missing receipts. So we need something like this PR -- good work Petar.

@petarjuki7
Copy link
Member Author

Forgot to mention here, I tested the happy case, works fine.
I simulated an error by just adding if operatorId == 15 { early return; }, and it showed errors correctly (picture attached)
image

@dknopik dknopik removed the v1.2.0 label Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-review This PR is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants