diff --git a/security/fma-interop-proofs.md b/security/fma-interop-proofs.md new file mode 100644 index 00000000..2327fa07 --- /dev/null +++ b/security/fma-interop-proofs.md @@ -0,0 +1,231 @@ +# [Interop Proofs]: Failure Modes and Recovery Path Analysis + + + + +- [Introduction](#introduction) +- [Failure Modes and Recovery Paths](#failure-modes-and-recovery-paths) + - [[Name of Failure Mode 1]](#name-of-failure-mode-1) + - [[Name of Failure Mode 2]](#name-of-failure-mode-2) +- [Audit Requirements](#audit-requirements) +- [Action Items](#action-items) +- [Appendix](#appendix) + - [Appendix A: This is a Placeholder Title](#appendix-a-this-is-a-placeholder-title) + + + +_Italics are used to indicate things that need to be replaced._ + +| | | +|--------------------|--------------------------| +| Author | Adrian Sutton | +| Created at | 2025-05-08 | +| Initial Reviewers | Mofi Taiwo, Paul Dowman | +| Need Approval From | _Security Reviewer Name_ | +| Status | Draft | + +> [!NOTE] +> ๐Ÿ“ข Remember: +> +> - The single approver in the โ€œNeed Approval Fromโ€ must be from the Security team. +> - Maintain the โ€œStatusโ€ property accordingly. An FMA document can have the following statuses: + > + +- **Draft ๐Ÿ“:** Doc is created but not yet ready for review. + +> - **In Review ๐Ÿ”Ž:** Security is reviewing, and Engineering is iterating on the design. A checklist of action items + will be created during this phase. + > + +- **Implementing Actions ๐Ÿ›ซ:** Security has signed off on the content of the document, including the resulting action + items. Engineering is responsible for implementing the action items, and updating the checklist. + +> - **Final ๐Ÿ‘:** Security will transition the status of the document to Final once all action items are completed. + +> [!TIP] +> Guidelines for writing a good analysis, and what the reviewer will look for: +> +> - Show your work: Include steps and tools for each conclusion. +> - Completeness of risks considered. +> - Include both implementation and operational failure modes +> - Provide references to support the reviewer. +> - The size of the document will likely be proportional to the project's complexity. +> - The ultimate goal of this document is to identify action items to improve the security of the project. The FMA + review process can be accelerated by proactively identifying action items during the writing process. + +## Introduction + +This document covers the changes made to the fault proof system to support interop. These are details in the +[interop fault proofs spec](https://specs.optimism.io/interop/fault-proof.html) + +### Shared DisputeGameFactory + +The `DisputeGameFactory` is now shared by all chains in a dependency set. Proposals are made using super roots which +include the output root for all chains in the dependency set. A single fault dispute game is used to decide the validity +of the state of all chains. + +### SuperFaultDisputeGame and SuperPermissionedDisputeGame + +Two new game implementations are provided: + +1. `SuperFaultDisputeGame` which replaces `FaultDisputeGame` +2. `SuperPermissionedDisputeGame` which replaces `PermissionedDisputeGame` + +Both of these contracts are largely the same as the pre-interop versions they replace. The key changes include: + +* Removing code to support challenging the L2 block number (now handled in the fault proof program) +* Preventing games being created with the root claim `keccak("invalid")` which is used as a marker for an always invalid + state +* Modifying the L2BlockNumber local key to always populate the PreimageOracle with the proposal timestamp for the game, + regardless of the claim's position in the dispute tree + +### Multistage Fault Proof Program + +The fault proof program (`op-program` and `kona`, though only `op-program` is currently in production) has been updated +to use separate steps to derive the chain. This ensures that the fault proof VM can execute each step in a reasonable +amount of time. In particular it ensures that a single step only needs to execute the full block from a single chain, +keeping resource usage roughly equivalent to pre-interop. There are a fixed 128 steps per timesstamp transition. The +first two steps reproduce the next block for a single chain from the L1 batch data without verifying executing messages. +Following steps are a no-op, to simplify the addition of more chains in the future. The final step verifies executing +messages across all chains and replaces any blocks found to be invalid with deposit-only blocks. + +### Invalid Proposal Timestamp Handling + +The L2 block number challenge is removed from the contracts and replaced by the fault proof program transitioning to +the invalid state (`keccak("invalid")`) if data is unavailable on L1 to reach the proposal timestamp. Previously, +reaching the L1 head was considered the end of derivation and simple trace extension was applied. This left ambiguity +as to whether the derivation had reached the proposal block (indicating it is valid) or had reach the L1 head (invalid). + +### op-challenger Updates + +`op-challenger` has been updated to handle the new game types. `op-supervisor` is now used as its primary source of +truth. The `TraceProvider` used for the top half of the game is a new implementation that follows the multi-step state +transition used by interop. + +### op-dispute-mon Updates + +`op-dispute-mon` has been updated to support the name game types. `op-supervisor` is now used as its primary source of +truth for proposals, using super roots instead of output roots. Monitoring is otherwise unchanged. + +## Failure Modes and Recovery Paths + +### FM1: Unavailability of Preimage Data + +- **Description:** + - The final consolidation step validates executing messages. To do this it must have access to the receipts from the + block as derived from L1 batch data. + - When that block is invalid, it will have been re-orged out of the canonical chain by honest nodes and may have been + pruned from its database. + - As the fault proof program derives the original optimistic block in a separate step it does not have the receipts. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - op-geth does not currently prune data from non-canonical blocks so would always have the required data available. + - Interop proofs action tests run using a source node that only contains canonical blocks. + - We periodically execute op-program using inputs from op-mainnet and op-sepolia. This periodic cannon + runner ([vm-runner]) runs on oplabs infrastructure using source nodes available to op-challenger. The vm-runner + samples game inputs for the latest L2 safe head every 2 hours and uses cannon to execute the op-program using the + sampled inputs. Note that this sampling does not include every game created. +- **Detection:** + - Existing monitoring for forecast or actual invalid game resolution +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - Deploy a fixed op-challenger. Does not require governance approval or a hard fork. + - If sufficient context is not available from the preimage hint emitted by the fault proof program, deploy an updated + prestate with additional information in the hint. Requires governance approval but can be done without a hard fork. + +### FM2: Problem in Game Design + +- **Description:** + - A problem in the design of the dispute game, in particular in the multistep super root state transition, could lead + to games resolving incorrectly. + - This may include valid proposals being invalidated and invalid proposals being confirmed. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - [Suite of action tests] verifying the sequence of expected states to transition between super roots in a wide + variety of cases. +- **Detection:** + - Existing monitoring for forecast and actual incorrect game results. +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - A fixed design would need to be devised, implemented, approved by governance and deployed. + +### FM3: Mismatch Between op-challenger and op-program + +- **Description:** + - To ensure it always wins games, op-challenger must calculate claim values that match the claim op-program considers + correct. + - A bug in op-challenger or op-program could cause these claims to be mismatched. + - This could lead to games resolving incorrectly, including valid proposals being invalidated and invalid proposals + being confirmed. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - [Suite of action tests] verifying the honest trace used by op-challenger matches the expected claims by op-program. + - Bottom half of the game continues to post cannon states as in pre-interop games. +- **Detection:** + - Existing monitoring for forecast and actual incorrect game results. + - We periodically execute op-program in [vm-runner] using inputs from op-mainnet and op-sepolia. The expected output + is calculated using the same op-challenger trace provider as is used to calculate the correct claims when playing + games. +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - If the bug is in op-challenger, deploy a fixed version. Does not require governance approval or a hard fork. + - If the bug is in op-program, deploy a fix via a new prestate. Requires governance approval but can be done without a + hard fork. + +### FM4: Increased Maximum Preimage Size + +- **Description:** + - The consolidation step introduces the first requirement for op-program to read receipts from L2 chains. + - Since L2 chains have higher gas limits it is possible to create larger receipts in a single block which may need to + be added into the PreimageOracle. + - The larger preimages may incur higher costs for the honest actor to populate the `PreimageOracle`. + - The preimage can still be posted via the existing large preimage proposal process which op-challenger automatically + utilises. +- **Risk Assessment:** + - Low impact, low likelihood +- **Mitigations:** + - Preimages only need to be populated into the `PreimageOracle` when the max depth is reached and `step()` is called. + This requires posting at least 300 ETH in bonds. +- **Detection:** + - [Existing monitoring for large preimage proposals](https://github.com/ethereum-optimism/k8s/blob/0eb3b759ecfe52ed36ece0531a559f11e699419f/grafana-cloud/terraform-rules/challenger.yaml#L94-L102) +- **Recovery Path(s):** + - Automatically handled by op-challenger + - Potentially increase bond sizes via new dispute game implementation deployment (requires governance approval). + +### Generic items we need to take into account: + +See [generic hardfork failure modes](./fma-generic-hardfork.md) +and [generic smart contract failure modes](./fma-generic-contracts.md). +Incorporate any applicable failure modes with FMA-specific mitigations and detections directly into this document. + +- [x] Check this box to confirm that these items have been considered and updated if necessary. + +## Action Items + +Below is what needs to be done before launch to reduce the chances of the above failure modes occurring, and to ensure +they can be detected and recovered from: + +- [ ] Resolve all comments on this document and incorporate them into the document itself (Assignee: document author) + + +## Audit Requirements + +An audit has been performed for op-program and is scheduled for the overall interop system, including fault proofs. + +## Appendix + +### Appendix A: Additional Interop FMAs + +Other interop FMAs have some overlap with the proofs system: + +* [Supervisor FMA](./fma-supervisor.md) includes failure modes around unavailability or incorrect results from op-supervisor +* [Portal FMA](./fma-interop-portal.md) includes failure modes related to the contract migration + +[Suite of action tests]: https://github.com/ethereum-optimism/optimism/blob/fa86f81da6bed8489508907ed0956134d029c09f/op-e2e/actions/interop/proofs_test.go + +[Fault Proof Recovery Runbook]: https://www.notion.so/oplabs/Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f?pvs=4 + +[vm-runner]: https://github.com/ethereum-optimism/optimism/tree/develop/op-challenger/runner