Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMA: OP-Supervisor #233

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

FMA: OP-Supervisor #233

wants to merge 4 commits into from

Conversation

axelKingsley
Copy link
Contributor

@axelKingsley axelKingsley commented Mar 26, 2025

Failure Mode Analysis for op-supervisor, a new consensus-critical piece of code for the op-stack.

@tynes tynes added this to the Interop RC Beta milestone Mar 27, 2025
@tynes tynes requested a review from Copilot March 27, 2025 18:42
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new markdown document that details the failure modes and recovery path analysis for the OP-Supervisor component. It outlines interop protocol behavior and provides in-depth discussions of multiple failure modes (FM1a–FM6), including their descriptions, risk assessments, and mitigation strategies.

@tynes tynes linked an issue Mar 27, 2025 that may be closed by this pull request
Copy link
Contributor

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Commented on a few of the items, but no blockers.

- *Even When* the Supervisor behaves totally correctly, this case may occur if some data cross-unsafe data is used to build a block which later becomes invalid due to a reorg on the referenced chain. In this situation, the same outcome is felt by the network.
- Mitigations
- The Sequencer could detect cross-unsafe head stall and issue a reorg on the unsafe chain in order to avoid invalid L1 inclusions. Depending on the heuristic used, this could create regular unsafe reorgs with low threshold, or larger, less common ones. This also saves operators from wasted L1 fees when a batch would contain unwanted data.
- When promoting local-unsafe to cross-unsafe, the Supervisor can additionally detect if the data it is stalled on is already cross-safe or not. If it is, it can proactively notify the Sequencer that the current chain is hopeless to be valid, creating a more eager reorg point.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth prioritizing more

- Mitigations
- The Sequencer could detect cross-unsafe head stall and issue a reorg on the unsafe chain in order to avoid invalid L1 inclusions. Depending on the heuristic used, this could create regular unsafe reorgs with low threshold, or larger, less common ones. This also saves operators from wasted L1 fees when a batch would contain unwanted data.
- When promoting local-unsafe to cross-unsafe, the Supervisor can additionally detect if the data it is stalled on is already cross-safe or not. If it is, it can proactively notify the Sequencer that the current chain is hopeless to be valid, creating a more eager reorg point.
- The Batcher can decline to post beyond the current cross-unsafe head. This will avoid the publishing of bad data so the sequencer may reorg it out, saving the replacement based reorg. If it went on long enough, the Batcher would prevent any new data from being posted to L1, effectively creating a safe-head stall until the sequencer resolved the issue. This *could* be a preferred scenario for some chains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety over liveness, I think this is valid to do by default.

That said, we should invest more in the syncing of the unsafe chain, so that low-latency batch-submission is not as important for stability. And the longer we defer batch-submission, the larger the gap in L1 fees and actual batch-costs may be.

- Mitigation
- Supervisor supports a feature in which *multiple* Nodes of a given chain can be connected as Managed Nodes. If one Node goes down, syncing can continue with the backup.
- This feature is mostly ready in the happy path, but there are known gaps in managing secondary Managed Nodes during block replacements and resets.
- This feature *also* needs much more robust testing in lifelike environments. Previous development cycles were spent in the devnet trying to enable this feature, which was slow and risky. To get this feature working well, we need to leverage our new network-wide testing abilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely an area that lacks testing due to prior prioritizations over this work.

Fortunately, @pcw109550 is starting on some sync testing. And I'm pushing for devnet-sdk to support the sync-testing context (multiple nodes, multi-supervisor, etc.) to support that.

Comment on lines +214 to +215
- Mitigation
- Standard Mode could allow for *multiple* Supervisor Endpoints to be specified, they could confirm that all endpoints agree, preventing dishonesty from one party from deceiving the Node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not viable now, but later with interop, we can potentially include proofs for trustless sync. Since all L2 data is really just some computation over L1 data, which the op-node can verify if they've access to a L1 header chain.

@axelKingsley axelKingsley marked this pull request as ready for review April 1, 2025 20:28
- High volume indexing for performance.
- Chaos-Monkey style testing for Correctness.
- Smarter behaviors for existing components:
- Batcher to only submit up-to cross-unsafe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you articulate the value proposition of only batch submitting up to cross unsafe? If i understand correctly, cross unsafe means that you have observed the existence of the initiating messages in a block, but they have not been finalized themselves yet. We would reduce risk of batch submitting something that would be reorg'd out

- There is a feature of the Supervisor that databases can be replicated over HTTP. If a Supervisor needs to be quickly synced it can bootstrap its databases from an existing Supervisor.
- Supervisors themselves should be made redundant, so if one goes down, another may serve in its place. Supervisor *replacement* is not something that is well tested, and more likely you’d switch to use that backup Supervisor *and* its Managed Nodes.

## FM2a: Supervisor Incorrectly Determines Message Validity in Response to Sequencer - Unsafe Block Only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action item: communicate clearly to applications that depend on unsafe blocks the new risks that come with interop. Across is an example application

- The Batcher can decline to post beyond the current cross-unsafe head. This will avoid the publishing of bad data so the sequencer may reorg it out, saving the replacement based reorg. If it went on long enough, the Batcher would prevent any new data from being posted to L1, effectively creating a safe-head stall until the sequencer resolved the issue. This *could* be a preferred scenario for some chains.
- We need to develop and use production-realistic networks in large scale testing to exercise failure cases and get confidence that the system behaves and recovers as expected.

## FM2b: Supervisor Doesn’t Catch Block Invalidation of Safe Data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could come up with a plan to roll back the chain, but this would break exchange integrations. This is critical severity and the most risky part of interop

- Any Node in Managed Mode would be unable to advance their chain state at all besides local-unsafe.
- Any Node in Standard Mode would be unable to advance cross-unsafe and cross-safe, but could still advance local-unsafe and local-safe. And because Validity can’t be checked, the Node may prefer not to trust Safe data either.
- Risk Assessment
- Low Impact, High likelihood.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End user impact would be similar to an unsafe head stall: the longer the supervisor is unavailable, the more $ is lost via things like changing on-chain prices making trades no longer profitable, or the delay resulting in users not being able to take time-sensitive actions. Similar to unsafe head stalls, the longer the supervisor is unavailable for, the more loss there is. Responding swiftly here is important and should be treated the same as how we respond to unsafe head stalls currently

- When the Supervisor attempts to promote the local-safe data to cross-safe, it discovers the invalid message and issues an invalidation and replacement.
- The Replacement block is applied to the block which contained the invalid message, and the chain has now reorg’d out all blocks from the invalid message to the safe head (effectively resetting the chain back to the stalled cross-unsafe head).
- Risk Assessment
- Medium Impact, Medium Likelihood.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a third party bridge or centralized exchange trusts unsafe interop messages, this is a risk. Dangerous scenario:

  1. Chain A includes initiating message for "move 1000 ETH to Chain B"
  2. Chain B sees the unsafe initiating message
  3. Executing message to mint 1000 ETH on Chain B is executed
  4. Attacker moves 1000 ETH out to L1 via a third party bridge or CEX which just trusts the unsafe head
  5. An unsafe chain reorg happens on Chain A, and the initiating message is reorged out (this could happen maliciously, or even via just an L1 reorg)
  6. Attacker has stolen 1000 ETH

Mitigation: Ensure that any applications which enabling moving substantial value out of the Superchain understand this risk, and ideally require making sure the chain is safe

- Any validators who *do not* rely on the failing Supervisor will see the correct chain, but there are currently no alternative implementations to use.
- An output root posted from this incorrect state would be open to be fault-proven.
- Risk Assessment
- Even Higher Impact, Low Likelihood.
Copy link

@K-Ho K-Ho Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying from the FMA review, the reason this is low likelihood, while 2a is medium likelihood is because 2a could happen even if there are no bugs (e.g. L1 has a major reorg) whereas 2b should never happen unless there is a bug in the supervisor.

This is the most critical impact of any of these. This can allow a user to mint uncapped amounts of ETH, and quickly exit that ETH from the superchain via CEXes, third party bridges, etc.
These bridges/cexes look at the safe head, and in this scenario, their checks would pass and enable these withdrawals.
To figure out:

  1. How much value exited out of the superchain would this be over how much time?
  2. How would we even respond if this happened? somehow halt all chains? pause L2->L1 withdrawals? How would we notify all third party bridges/cexes?
    ^ also assuming this is a huge amount of $, what do we need to do to properly mitigate this risk (alternative supervisor implementation, audit competiton, etc.?)

- Risk Assessment
- Even Higher Impact, Low Likelihood.
- Recovery
- Within the 12h Safe Head window, if the sequencer is repaired and rewound, it could correctly interpret the batch data to replace a block, and then would have the job of rebuilding the chain from that point. All users would need to upgrade to a version without this derivation bug, and resync the chain from the failed position. Operators who *did not* upgrade and resync would be left on a dead branch that is no longer being updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things that can help with this issue:

  • multiple clients
  • 2 step cross chain messages plus pause button
  • monitoring and runbook for triggering a reorg
  • only build blocks based on safe cross chain messages
  • use sequencer policy to only include ether transfers that are safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interop op-supervisor FMA
5 participants