Add API for sharded serialization to a digest #198

mihaimaruseac · 2024-06-05T20:58:35Z

Summary

This is what used to be serialize_v1.

Additionally, in this change we rename serializing to serialization to be gramatically correct. We expose shard_size and a new digest_size method from all hashing engines. We also make imports be more consistent.

Release Note

NONE

Documentation

NONE

mihaimaruseac · 2024-06-05T21:57:07Z

Lint would be fixed after #199

This is what used to be `serialize_v1`. Additionally, in this change we rename `serializing` to `serialization` to be gramatically correct. We expose `shard_size` and a new `digest_size` method from all hashing engines. We also make imports be more consistent. Signed-off-by: Mihai Maruseac <[email protected]>

Signed-off-by: Mihai Maruseac <[email protected]>

laurentsimon · 2024-06-05T22:58:25Z

model_signing/serialization/dfs.py

+        digest_len = self._merge_hasher.digest_size
+        digests_buffer = bytearray(len(tasks) * digest_len)
+
+        with concurrent.futures.ThreadPoolExecutor(


do we want to use Thread or process pool executor? If the tasks are CPU bound, then thread won't help iiuc, because of GIL

We cannot use ProcessPoolExecutor in a library. It's supposed to have visibility into __main__.

Also, here we are actually better with using threads, since the computation is IO bound, not CPU. Each file read releases the GIL.

I'll run a benchmark with both, of course.

laurentsimon · 2024-06-05T23:04:17Z

model_signing/serialization/dfs.py

+        # Note: no "." at end here, it will be added by `join` on return.
+        encoded_range = b""
+
+    return b".".join([encoded_type, encoded_name, encoded_range])


this will output type.name.xxx-yyy. I think we need an additional . at the end of the string, otherwise it's malleable type.name.xxx-yyy + content == type.name.xxx-y + yycontent. I think we're doing plain concatenation of header and content. Lmk if that's not the case

I'm adding a dot at the end in line 65 ({start}-{end}.)

might be more readable to add to the return line... but does not work with the empty encoded_range :/

model_signing/serialization/dfs.py

laurentsimon · 2024-06-05T23:20:11Z

model_signing/hashing/file.py

@@ -219,4 +225,4 @@ def compute(self) -> hashing.Digest:
    def digest_name(self) -> str:
        if self._digest_name_override is not None:
            return self._digest_name_override
-        return f"file-{self._content_hasher.digest_name}-{self._shard_size}"
+        return f"file-{self._content_hasher.digest_name}-{self.shard_size}"


note: here we have to be careful that digest name do not contain the character -. Either we add a check somewhere, or we use a different character, like $, |. Maybe you're thinking of another PR to customize this value anyway, since name-shard_size could lead to confusion anyway since the sharding "algo" is not specified

Hmm, we could standardize on using . everywhere between the fields.

So far, we control all the components so there's no risk of confusion. I'm pushing testing /more customization for this after we have most of the API in place (at least after we have the full manifest format), as we should be able to quickly fix issues.

Signed-off-by: Mihai Maruseac <[email protected]>

laurentsimon · 2024-06-06T17:16:49Z

model_signing/serialization/dfs.py

+        # Note: no "." at end here, it will be added by `join` on return.
+        encoded_range = b""
+
+    return b".".join([encoded_type, encoded_name, encoded_range])


might be more readable to add to the return line... but does not work with the empty encoded_range :/

mihaimaruseac added this to the V1 release milestone Jun 5, 2024

mihaimaruseac requested a review from a team as a code owner June 5, 2024 20:58

mihaimaruseac force-pushed the api-manifest branch 4 times, most recently from ec380f1 to ad549a8 Compare June 5, 2024 21:55

mihaimaruseac mentioned this pull request Jun 5, 2024

Chore: Test for missing endline at eof, fix lint errors. #199

Merged

mihaimaruseac force-pushed the api-manifest branch 3 times, most recently from 236f0c2 to f4a4b32 Compare June 5, 2024 22:27

mihaimaruseac added 4 commits June 5, 2024 15:51

Change TODOs to link to issues

d93bd41

Signed-off-by: Mihai Maruseac <[email protected]>

Add test for fifo

3496f64

Signed-off-by: Mihai Maruseac <[email protected]>

Test root as pipe too

76c2836

Signed-off-by: Mihai Maruseac <[email protected]>

mihaimaruseac force-pushed the api-manifest branch from e96ce24 to 76c2836 Compare June 5, 2024 22:51

laurentsimon reviewed Jun 5, 2024

View reviewed changes

Merge _get_sizes and _build_tasks

ba9a4d8

Signed-off-by: Mihai Maruseac <[email protected]>

mihaimaruseac requested a review from laurentsimon June 6, 2024 15:35

laurentsimon approved these changes Jun 6, 2024

View reviewed changes

mihaimaruseac merged commit 24d68a4 into sigstore:main Jun 6, 2024
18 checks passed

mihaimaruseac deleted the api-manifest branch June 6, 2024 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API for sharded serialization to a digest #198

Add API for sharded serialization to a digest #198

mihaimaruseac commented Jun 5, 2024

mihaimaruseac commented Jun 5, 2024

laurentsimon Jun 5, 2024

mihaimaruseac Jun 5, 2024

laurentsimon Jun 5, 2024 •

edited

Loading

mihaimaruseac Jun 5, 2024

laurentsimon Jun 6, 2024

laurentsimon Jun 5, 2024

mihaimaruseac Jun 5, 2024

laurentsimon Jun 6, 2024

Add API for sharded serialization to a digest #198

Add API for sharded serialization to a digest #198

Conversation

mihaimaruseac commented Jun 5, 2024

Summary

Release Note

Documentation

mihaimaruseac commented Jun 5, 2024

laurentsimon Jun 5, 2024

Choose a reason for hiding this comment

mihaimaruseac Jun 5, 2024

Choose a reason for hiding this comment

laurentsimon Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

mihaimaruseac Jun 5, 2024

Choose a reason for hiding this comment

laurentsimon Jun 6, 2024

Choose a reason for hiding this comment

laurentsimon Jun 5, 2024

Choose a reason for hiding this comment

mihaimaruseac Jun 5, 2024

Choose a reason for hiding this comment

laurentsimon Jun 6, 2024

Choose a reason for hiding this comment

laurentsimon Jun 5, 2024 •

edited

Loading