Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe proposed blob structure #64

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions documentation/Blob Structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Blob Structure

## Context

Blobs can be shared between multiple rollups, as noted in the [Overall Design document](./Overall%20Design.md). The proposed blob structure is guided by the following design goals:

- blob boundaries are not meaningful, but each publication should be contained within a transaction.
- allow proposers to decide per-publication which rollups to include, based on proposal rights and available transactions.
- support any compression algorithm, enabling shared compression for better ratios when multiple rollups use the same algorithm.
- remain minimally opinionated, so rollups can update the structure as desired.


## Existing standards

### Nethermind's proposal

Nethermind's [proposal](https://hackmd.io/@linoscope/blob-sharing-for-based-rollups) is simple but suboptimal for a minimal rollup case because:

- segmentation information is published on L1, increasing L1 requirements.
- rollups sharing a compression algorithm should decompress the publication before splitting it.

### Spire's proposal

Spire has created [this proposal](https://paragraph.xyz/@spire/shared-blob-compression). However, a Merkle tree structure may not be useful because:

- the main advantage of a Merkle tree is to efficiently prove some leaves. In our case we will need each rollup to process the full list of transactions in the publication, so at the very least we should treat each rollup's transactions as one leaf.
- if two rollups share a compression algorithm, both transaction lists will need to be decompressed together (if they want to take advantage of shared compression), so in that case they should also be a single leaf, and we will need an additional mechanism to describe the split within the compressed blob.

It seems a flat structure (which is one of the options they mention) is both simpler and more efficient.

Their proposal also requires registering with the blob aggregation service so it knows which DA layer and compression mechanism to use. Moreover, the rollups are responsible for ensuring the blobs they retrieve out of the aggregation service match the ones that were passed in (using a signature).

The minimal rollup scope only considers Ethereum as the DA layer, avoiding introducing an additional off-chain service. Since the supported compression algorithms are unlikely to change often, rollups may register them on chain so individual L1 proposers can choose their preferred algorithm.

### Dankrad's proposal

Dankrad has created [this proposal](https://ethresear.ch/t/suggested-format-for-shard-blob-header/9996), which is mostly suitable for a minimal rollup with the following modifications:

- identify each supported compression algorithm with an `application_id` as if it were a _format_ identifier
- Multiple rollups can find and decompress the same compressed data, and a way to split the decompressed data will be required. The suggested approach is using the same segmenting scheme inside the compressed segment
- his post allows for up to 1535 applications per blob. This has some cascading effects:
- the header is expected to be sorted by application ID, enabling each application to binary search the pointer to the start of the compressed data in the blob
- if the search fails, either as a result of unsorted data or any other reason, the data is considered unusable. Applications must not reach different conclusions about what data is stored for a given application ID
- even though publications can span several blobs, it's expected that only a handful of rollups will publish together, perhaps grouped under the same _compression id_ (i.e. `application_id`). Instead of a binary search, a linear search through the whole header can be used to ensure ids are sorted.


## The proposal

With these modifications to Dankrad's proposal, here is a complete description of the suggested design. The description is for how an L2 rollup node must update its state. The corresponding protocol for proposers and provers should follow naturally.

### Preparation

Whoever configures the rollup should publish (perhaps in an on-chain registry) the full list of unique 3-byte data identifiers that the rollup nodes should process. This should include an identifier for:

- its own rollup transactions
- any other rollup transactions it should process (eg. for embedded rollups)
- any other data feeds it should process (eg. for synchrony or coordination)
- any compression (or other data processing) algorithms it supports

### Publication Hashes

The rollup node should find relevant publication hashes in the `PublicationFeed` contract. This will include anything published by its inbox plus any other data feeds that it is watching. There may be multiple relevant publications that share the same blobs, and the rollup node will need to track which blobs it is processing for which reasons. In the simplest case it is just looking for its own rollup transactions inside publications made by its own inbox.

### Publication Headers

For each relevant hash, the publication is the concatenation of all the corresponding blobs. The rollup node should retrieve and validate the first 31-byte chunk, and interpret it as a header with the structure:

- the version (set to zero) (1 byte)
- header length in chunks, excluding this one (1 byte)
- Most of the times, this field is expected to be `0` since this chunk accommodates 5 segments already; likely enough for most publications.
- multiplier (1 byte)
- the log base 2 of a scaling factor for data offsets
- i.e. if this value is $x$ and an offset is specified as $y$, the relevant data starts at chunk $2^x \cdot y$
- Up to 5 data type records consisting of
- data identifier (3 bytes)
- offset to corresponding segment after accounting for the multiplier (2 bytes)
- Pad the remaining space with zeroes. Note that zero is an invalid data identifier, so there should be no confusion about what counts as padding.

If the header is larger than one chunk, retrieve and validate the remaining chunks. They should all be structured with the same data type records (up to 6 in a chunk) and padded with zeros.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are more than one blobs in a publication, will only the first blob contains the header (if the header is small enough to fit into a blob)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to think of the whole structure as corresponding to one file. The header is the start of the file, which spans as many blobs as necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so the header spans only at the beginning of the first blob since we're not expecting a significant amount of application_id. In theory, a header can even span more than 1 blob but I don't think that's realistic


Although the proposal doesn't rely on binary search, it is still useful to have a deterministic structure. Therefore, in addition to validating the header structure, the rollup nodes should ensure:

- all data identifiers are listed in ascending order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not use binary search, will the order of data identifiers matter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default. I think things like json are intentionally flexible because they're partially intended to be human readable.

If we insist on ascending order then:

  • two publications can be checked for equality directly (by comparing their bytes or their hashes)
  • we can reintroduce binary search if the sizes ever get large enough

This is a very mild preference because I don't see a particular reason to prefer one over the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support the ascending order approach, I would expect linear search to be simpler for small lists. Here we expect around 5 segments (or application_ids).

I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default.

Does a structured data format also helps to identify invalid chunks? I also feel checking the order is not needed as long as the application knows how to read their own data

- the actual data segments are also in the same order (i.e. the segment offsets are strictly increasing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to use the next segment’s offset to determine the current segment’s size. This way, app A can use segment [100, 200], while app B can use segment [100, 300], allowing for greater flexibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're conceptualising "segments" differently, and perhaps that's relevant to this comment, whether it matters that the segments are ordered, and whether we need a length field.

I imagine each segment contains a particular type of information (where the type is identified by the data format id). In your example, something about the data needs to change at position 200 for app A to decide to stop processing it (eg. perhaps 100-200 represents A's transactions, while B embeds rollup A so it needs to know about A's transactions, but also needs to care about its own transactions in 200-300). In that case, I'd still call 100-200 one segment, and 200-300 another segment, and B would have registered that it wants to process both types.

Can you think of a scenario where rollups process partially overlapping information, but it does not make sense to conceptually divide the overlapping part as its own data segment?

- all segment offsets point within the publication
- note that the size of a segment is implicitly defined by where the next segment starts.

Assuming the publication header is valid, the rollup node will know which chunks to retrieve and validate to find its relevant segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


### Compressed segments

The data identifier will indicate which kind of compression (eg. gzip, lz4, etc) was used. After decompression, the segment should use the same structure so that it can be further subdivided. For simplicity:

- the segment header (which has the same format as the publication header) should be outside the compression. This means the expected contents can be identified (and potentially ignored) before retrieving the whole segment and decompressing it.
- Nested compressed segments are discouraged to avoid nodes recursing indefinitely only to find a poorly-formed publication. However, format identifiers are generic enough to represent any kind of data, including the number of compression rounds.

### Rollup segments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a rollup consists of multiple segments, should it concatenate these segments (excluding their headers) into a single byte array and validate them as a whole, or should each segment be validated individually, with the results (potentially structured) then concatenated together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we're validating here but I think you wouldn't concatenate them in either case. Any data that should be processed together should be part of the same segment. If a rollup processes multiple segments, it will be because they're broken down into semantically different things like:

  • the rollup transactions
  • another rollup's transactions
  • coordination information with other rollups
  • potentially configuration information or updates that do not require releasing new node software

So the idea would be:

  • use the headers to identify and locate the relevant segments
  • if there are compressed segments, use the headers to ensure they are relevant and then decompress / locate the relevant segments
  • you now have different segments that need to be processed independently


After finding and decompressing all relevant segments, the rollup node should process them. The data structure should be defined by the particular use case, with the following recommendations:

- avoid derivable information (such as block hashes or state roots)
- instead, the segment should include the minimum information required to reconstruct the state, which would be something like raw L2 transactions interspersed with block timestamp delimiters.
- pad the rest of the segment with zeros
Loading