Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for sharded serialization to a digest #198

Merged
merged 5 commits into from
Jun 6, 2024

Conversation

mihaimaruseac
Copy link
Collaborator

Summary

This is what used to be serialize_v1.

Additionally, in this change we rename serializing to serialization to be gramatically correct. We expose shard_size and a new digest_size method from all hashing engines. We also make imports be more consistent.

Release Note

NONE

Documentation

NONE

@mihaimaruseac
Copy link
Collaborator Author

Lint would be fixed after #199

@mihaimaruseac mihaimaruseac force-pushed the api-manifest branch 3 times, most recently from 236f0c2 to f4a4b32 Compare June 5, 2024 22:27
This is what used to be `serialize_v1`.

Additionally, in this change we rename `serializing` to `serialization`
to be gramatically correct. We expose `shard_size` and a new
`digest_size` method from all hashing engines. We also make imports be
more consistent.

Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
digest_len = self._merge_hasher.digest_size
digests_buffer = bytearray(len(tasks) * digest_len)

with concurrent.futures.ThreadPoolExecutor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to use Thread or process pool executor? If the tasks are CPU bound, then thread won't help iiuc, because of GIL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use ProcessPoolExecutor in a library. It's supposed to have visibility into __main__.

Also, here we are actually better with using threads, since the computation is IO bound, not CPU. Each file read releases the GIL.

I'll run a benchmark with both, of course.

# Note: no "." at end here, it will be added by `join` on return.
encoded_range = b""

return b".".join([encoded_type, encoded_name, encoded_range])
Copy link
Collaborator

@laurentsimon laurentsimon Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will output type.name.xxx-yyy. I think we need an additional . at the end of the string, otherwise it's malleable type.name.xxx-yyy + content == type.name.xxx-y + yycontent. I think we're doing plain concatenation of header and content. Lmk if that's not the case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a dot at the end in line 65 ({start}-{end}.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more readable to add to the return line... but does not work with the empty encoded_range :/

model_signing/serialization/dfs.py Outdated Show resolved Hide resolved
@@ -219,4 +225,4 @@ def compute(self) -> hashing.Digest:
def digest_name(self) -> str:
if self._digest_name_override is not None:
return self._digest_name_override
return f"file-{self._content_hasher.digest_name}-{self._shard_size}"
return f"file-{self._content_hasher.digest_name}-{self.shard_size}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: here we have to be careful that digest name do not contain the character -. Either we add a check somewhere, or we use a different character, like $, |. Maybe you're thinking of another PR to customize this value anyway, since name-shard_size could lead to confusion anyway since the sharding "algo" is not specified

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we could standardize on using . everywhere between the fields.

So far, we control all the components so there's no risk of confusion. I'm pushing testing /more customization for this after we have most of the API in place (at least after we have the full manifest format), as we should be able to quickly fix issues.

# Note: no "." at end here, it will be added by `join` on return.
encoded_range = b""

return b".".join([encoded_type, encoded_name, encoded_range])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more readable to add to the return line... but does not work with the empty encoded_range :/

@mihaimaruseac mihaimaruseac merged commit 24d68a4 into sigstore:main Jun 6, 2024
18 checks passed
@mihaimaruseac mihaimaruseac deleted the api-manifest branch June 6, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants