-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe proposed blob structure #64
base: main
Are you sure you want to change the base?
Changes from all commits
e7cb965
7e91fde
235c80a
5f5b9a2
fa43796
fe8fcec
b872f44
cea7570
23a1535
a3fa651
e918adb
c6f1060
c8e36a4
5c415e9
c414ca4
aa3ed92
863ef3e
1a43366
436be36
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,104 @@ | ||||
# Blob Structure | ||||
|
||||
## Context | ||||
|
||||
Blobs can be shared between multiple rollups, as noted in the [Overall Design document](./Overall%20Design.md). The proposed blob structure is guided by the following design goals: | ||||
|
||||
- blob boundaries are not meaningful, but each publication should be contained within a transaction. | ||||
- allow proposers to decide per-publication which rollups to include, based on proposal rights and available transactions. | ||||
- support any compression algorithm, enabling shared compression for better ratios when multiple rollups use the same algorithm. | ||||
- remain minimally opinionated, so rollups can update the structure as desired. | ||||
|
||||
|
||||
## Existing standards | ||||
|
||||
### Nethermind's proposal | ||||
|
||||
Nethermind's [proposal](https://hackmd.io/@linoscope/blob-sharing-for-based-rollups) is simple but suboptimal for a minimal rollup case because: | ||||
|
||||
- segmentation information is published on L1, increasing L1 requirements. | ||||
- rollups sharing a compression algorithm should decompress the publication before splitting it. | ||||
|
||||
### Spire's proposal | ||||
|
||||
Spire has created [this proposal](https://paragraph.xyz/@spire/shared-blob-compression). However, a Merkle tree structure may not be useful because: | ||||
|
||||
- the main advantage of a Merkle tree is to efficiently prove some leaves. In our case we will need each rollup to process the full list of transactions in the publication, so at the very least we should treat each rollup's transactions as one leaf. | ||||
- if two rollups share a compression algorithm, both transaction lists will need to be decompressed together (if they want to take advantage of shared compression), so in that case they should also be a single leaf, and we will need an additional mechanism to describe the split within the compressed blob. | ||||
|
||||
It seems a flat structure (which is one of the options they mention) is both simpler and more efficient. | ||||
|
||||
Their proposal also requires registering with the blob aggregation service so it knows which DA layer and compression mechanism to use. Moreover, the rollups are responsible for ensuring the blobs they retrieve out of the aggregation service match the ones that were passed in (using a signature). | ||||
|
||||
The minimal rollup scope only considers Ethereum as the DA layer, avoiding introducing an additional off-chain service. Since the supported compression algorithms are unlikely to change often, rollups may register them on chain so individual L1 proposers can choose their preferred algorithm. | ||||
|
||||
### Dankrad's proposal | ||||
|
||||
Dankrad has created [this proposal](https://ethresear.ch/t/suggested-format-for-shard-blob-header/9996), which is mostly suitable for a minimal rollup with the following modifications: | ||||
|
||||
- identify each supported compression algorithm with an `application_id` as if it were a _format_ identifier | ||||
- Multiple rollups can find and decompress the same compressed data, and a way to split the decompressed data will be required. The suggested approach is using the same segmenting scheme inside the compressed segment | ||||
- his post allows for up to 1535 applications per blob. This has some cascading effects: | ||||
- the header is expected to be sorted by application ID, enabling each application to binary search the pointer to the start of the compressed data in the blob | ||||
- if the search fails, either as a result of unsorted data or any other reason, the data is considered unusable. Applications must not reach different conclusions about what data is stored for a given application ID | ||||
- even though publications can span several blobs, it's expected that only a handful of rollups will publish together, perhaps grouped under the same _compression id_ (i.e. `application_id`). Instead of a binary search, a linear search through the whole header can be used to ensure ids are sorted. | ||||
|
||||
|
||||
## The proposal | ||||
|
||||
With these modifications to Dankrad's proposal, here is a complete description of the suggested design. The description is for how an L2 rollup node must update its state. The corresponding protocol for proposers and provers should follow naturally. | ||||
|
||||
### Preparation | ||||
|
||||
Whoever configures the rollup should publish (perhaps in an on-chain registry) the full list of unique 3-byte data identifiers that the rollup nodes should process. This should include an identifier for: | ||||
|
||||
- its own rollup transactions | ||||
- any other rollup transactions it should process (eg. for embedded rollups) | ||||
- any other data feeds it should process (eg. for synchrony or coordination) | ||||
- any compression (or other data processing) algorithms it supports | ||||
|
||||
### Publication Hashes | ||||
|
||||
The rollup node should find relevant publication hashes in the `PublicationFeed` contract. This will include anything published by its inbox plus any other data feeds that it is watching. There may be multiple relevant publications that share the same blobs, and the rollup node will need to track which blobs it is processing for which reasons. In the simplest case it is just looking for its own rollup transactions inside publications made by its own inbox. | ||||
|
||||
### Publication Headers | ||||
|
||||
For each relevant hash, the publication is the concatenation of all the corresponding blobs. The rollup node should retrieve and validate the first 31-byte chunk, and interpret it as a header with the structure: | ||||
|
||||
- the version (set to zero) (1 byte) | ||||
- header length in chunks, excluding this one (1 byte) | ||||
- Most of the times, this field is expected to be `0` since this chunk accommodates 5 segments already; likely enough for most publications. | ||||
- multiplier (1 byte) | ||||
- the log base 2 of a scaling factor for data offsets | ||||
- i.e. if this value is $x$ and an offset is specified as $y$, the relevant data starts at chunk $2^x \cdot y$ | ||||
- Up to 5 data type records consisting of | ||||
- data identifier (3 bytes) | ||||
- offset to corresponding segment after accounting for the multiplier (2 bytes) | ||||
- Pad the remaining space with zeroes. Note that zero is an invalid data identifier, so there should be no confusion about what counts as padding. | ||||
|
||||
If the header is larger than one chunk, retrieve and validate the remaining chunks. They should all be structured with the same data type records (up to 6 in a chunk) and padded with zeros. | ||||
|
||||
Although the proposal doesn't rely on binary search, it is still useful to have a deterministic structure. Therefore, in addition to validating the header structure, the rollup nodes should ensure: | ||||
|
||||
- all data identifiers are listed in ascending order | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do not use binary search, will the order of data identifiers matter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default. I think things like json are intentionally flexible because they're partially intended to be human readable. If we insist on ascending order then:
This is a very mild preference because I don't see a particular reason to prefer one over the other. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would support the ascending order approach, I would expect linear search to be simpler for small lists. Here we expect around 5 segments (or
Does a structured data format also helps to identify invalid chunks? I also feel checking the order is not needed as long as the application knows how to read their own data |
||||
- the actual data segments are also in the same order (i.e. the segment offsets are strictly increasing) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer not to use the next segment’s offset to determine the current segment’s size. This way, app A can use segment [100, 200], while app B can use segment [100, 300], allowing for greater flexibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're conceptualising "segments" differently, and perhaps that's relevant to this comment, whether it matters that the segments are ordered, and whether we need a length field. I imagine each segment contains a particular type of information (where the type is identified by the data format id). In your example, something about the data needs to change at position 200 for app A to decide to stop processing it (eg. perhaps 100-200 represents A's transactions, while B embeds rollup A so it needs to know about A's transactions, but also needs to care about its own transactions in 200-300). In that case, I'd still call 100-200 one segment, and 200-300 another segment, and B would have registered that it wants to process both types. Can you think of a scenario where rollups process partially overlapping information, but it does not make sense to conceptually divide the overlapping part as its own data segment? |
||||
- all segment offsets point within the publication | ||||
- note that the size of a segment is implicitly defined by where the next segment starts. | ||||
|
||||
Assuming the publication header is valid, the rollup node will know which chunks to retrieve and validate to find its relevant segments. | ||||
|
||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
|
||||
### Compressed segments | ||||
|
||||
The data identifier will indicate which kind of compression (eg. gzip, lz4, etc) was used. After decompression, the segment should use the same structure so that it can be further subdivided. For simplicity: | ||||
|
||||
- the segment header (which has the same format as the publication header) should be outside the compression. This means the expected contents can be identified (and potentially ignored) before retrieving the whole segment and decompressing it. | ||||
- Nested compressed segments are discouraged to avoid nodes recursing indefinitely only to find a poorly-formed publication. However, format identifiers are generic enough to represent any kind of data, including the number of compression rounds. | ||||
|
||||
### Rollup segments | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a rollup consists of multiple segments, should it concatenate these segments (excluding their headers) into a single byte array and validate them as a whole, or should each segment be validated individually, with the results (potentially structured) then concatenated together? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what we're validating here but I think you wouldn't concatenate them in either case. Any data that should be processed together should be part of the same segment. If a rollup processes multiple segments, it will be because they're broken down into semantically different things like:
So the idea would be:
|
||||
|
||||
After finding and decompressing all relevant segments, the rollup node should process them. The data structure should be defined by the particular use case, with the following recommendations: | ||||
|
||||
- avoid derivable information (such as block hashes or state roots) | ||||
- instead, the segment should include the minimum information required to reconstruct the state, which would be something like raw L2 transactions interspersed with block timestamp delimiters. | ||||
- pad the rest of the segment with zeros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are more than one blobs in a publication, will only the first blob contains the header (if the header is small enough to fit into a blob)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense to think of the whole structure as corresponding to one file. The header is the start of the file, which spans as many blobs as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so the header spans only at the beginning of the first blob since we're not expecting a significant amount of
application_id
. In theory, a header can even span more than 1 blob but I don't think that's realistic