Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe proposed blob structure #64

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

nikeshnazareth
Copy link
Collaborator

No description provided.


- the version (set to zero) (1 byte)
- header length in chunks, excluding this one (1 byte)
- I expect this to be zero most of the time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why header length is zero most of the time?

Copy link
Collaborator Author

@nikeshnazareth nikeshnazareth Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I suspect the 5 segments already described by the first chunk will be sufficient. We only need one segment per content type (eg. rollup, or compression mechanism). I just added this explanation to the doc

- offset to corresponding segment after accounting for the multiplier (2 bytes)
- Pad the remaining space with zeroes. Note that zero is an invalid data identifier, so there should be no confusion about what counts as padding.

If the header is larger than one chunk, retrieve and validate the remaining chunks. They should all be structured with the same data type records (up to 6 in a chunk) and padded with zeros.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are more than one blobs in a publication, will only the first blob contains the header (if the header is small enough to fit into a blob)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to think of the whole structure as corresponding to one file. The header is the start of the file, which spans as many blobs as necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so the header spans only at the beginning of the first blob since we're not expecting a significant amount of application_id. In theory, a header can even span more than 1 blob but I don't think that's realistic


Although we are not using binary search, I still think it is useful to have a deterministic structure. Therefore, in addition to validating the header structure, the rollup nodes should ensure:

- all data identifiers are listed in ascending order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not use binary search, will the order of data identifiers matter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default. I think things like json are intentionally flexible because they're partially intended to be human readable.

If we insist on ascending order then:

  • two publications can be checked for equality directly (by comparing their bytes or their hashes)
  • we can reintroduce binary search if the sizes ever get large enough

This is a very mild preference because I don't see a particular reason to prefer one over the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support the ascending order approach, I would expect linear search to be simpler for small lists. Here we expect around 5 segments (or application_ids).

I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default.

Does a structured data format also helps to identify invalid chunks? I also feel checking the order is not needed as long as the application knows how to read their own data


The data identifier will indicate which kind of compression (eg. gzip, lz4, etc) was used. After decompression, the segment should use the same structure so that it can be further subdivided. For simplicity:

- the segment header should be outside the compression. This means the expected contents can be identified (and potentially ignored) before retrieving the whole segment and decompressing it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no definition of "segment header"...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a comment emphasising that it's the same format as the publication header

- the segment header should be outside the compression. This means the expected contents can be identified (and potentially ignored) before retrieving the whole segment and decompressing it.
- we should disallow nested compressed segments. I suspect they are unlikely to achieve much (you can always define the data identifier as multiple rounds of a compression algorithm) and we don't want the node to have to recurse indefinitely only to find the publication is poorly formed.

### Rollup segments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a rollup consists of multiple segments, should it concatenate these segments (excluding their headers) into a single byte array and validate them as a whole, or should each segment be validated individually, with the results (potentially structured) then concatenated together?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we're validating here but I think you wouldn't concatenate them in either case. Any data that should be processed together should be part of the same segment. If a rollup processes multiple segments, it will be because they're broken down into semantically different things like:

  • the rollup transactions
  • another rollup's transactions
  • coordination information with other rollups
  • potentially configuration information or updates that do not require releasing new node software

So the idea would be:

  • use the headers to identify and locate the relevant segments
  • if there are compressed segments, use the headers to ensure they are relevant and then decompress / locate the relevant segments
  • you now have different segments that need to be processed independently

Although we are not using binary search, I still think it is useful to have a deterministic structure. Therefore, in addition to validating the header structure, the rollup nodes should ensure:

- all data identifiers are listed in ascending order
- the actual data segments are also in the same order (i.e. the segment offsets are strictly increasing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to use the next segment’s offset to determine the current segment’s size. This way, app A can use segment [100, 200], while app B can use segment [100, 300], allowing for greater flexibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're conceptualising "segments" differently, and perhaps that's relevant to this comment, whether it matters that the segments are ordered, and whether we need a length field.

I imagine each segment contains a particular type of information (where the type is identified by the data format id). In your example, something about the data needs to change at position 200 for app A to decide to stop processing it (eg. perhaps 100-200 represents A's transactions, while B embeds rollup A so it needs to know about A's transactions, but also needs to care about its own transactions in 200-300). In that case, I'd still call 100-200 one segment, and 200-300 another segment, and B would have registered that it wants to process both types.

Can you think of a scenario where rollups process partially overlapping information, but it does not make sense to conceptually divide the overlapping part as its own data segment?

@ggonzalez94 ggonzalez94 added the documentation Improvements or additions to documentation label Mar 10, 2025
Copy link
Member

@ernestognw ernestognw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I reworded some sections, mainly aiming for clarity and to make the writing less personal


Although we are not using binary search, I still think it is useful to have a deterministic structure. Therefore, in addition to validating the header structure, the rollup nodes should ensure:

- all data identifiers are listed in ascending order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support the ascending order approach, I would expect linear search to be simpler for small lists. Here we expect around 5 segments (or application_ids).

I don't think the order matters, but my intuition is that data formats should be as deterministic/structured as possible by default.

Does a structured data format also helps to identify invalid chunks? I also feel checking the order is not needed as long as the application knows how to read their own data

- note that the size of a segment is implicitly defined by where the next segment starts.

Assuming the publication header is valid, the rollup node will know which chunks to retrieve and validate to find its relevant segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Member

@ernestognw ernestognw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nikeshnazareth and others added 4 commits March 13, 2025 18:44
Co-authored-by: Ernesto García <[email protected]>
Co-authored-by: Ernesto García <[email protected]>
Co-authored-by: Ernesto García <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants