-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API for sharded serialization to a digest #198
Conversation
ec380f1
to
ad549a8
Compare
Lint would be fixed after #199 |
236f0c2
to
f4a4b32
Compare
This is what used to be `serialize_v1`. Additionally, in this change we rename `serializing` to `serialization` to be gramatically correct. We expose `shard_size` and a new `digest_size` method from all hashing engines. We also make imports be more consistent. Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
Signed-off-by: Mihai Maruseac <[email protected]>
e96ce24
to
76c2836
Compare
digest_len = self._merge_hasher.digest_size | ||
digests_buffer = bytearray(len(tasks) * digest_len) | ||
|
||
with concurrent.futures.ThreadPoolExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to use Thread or process pool executor? If the tasks are CPU bound, then thread won't help iiuc, because of GIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use ProcessPoolExecutor
in a library. It's supposed to have visibility into __main__
.
Also, here we are actually better with using threads, since the computation is IO bound, not CPU. Each file read releases the GIL.
I'll run a benchmark with both, of course.
# Note: no "." at end here, it will be added by `join` on return. | ||
encoded_range = b"" | ||
|
||
return b".".join([encoded_type, encoded_name, encoded_range]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will output type.name.xxx-yyy
. I think we need an additional .
at the end of the string, otherwise it's malleable type.name.xxx-yyy + content == type.name.xxx-y + yycontent
. I think we're doing plain concatenation of header and content. Lmk if that's not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a dot at the end in line 65 ({start}-{end}.
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be more readable to add to the return
line... but does not work with the empty encoded_range :/
@@ -219,4 +225,4 @@ def compute(self) -> hashing.Digest: | |||
def digest_name(self) -> str: | |||
if self._digest_name_override is not None: | |||
return self._digest_name_override | |||
return f"file-{self._content_hasher.digest_name}-{self._shard_size}" | |||
return f"file-{self._content_hasher.digest_name}-{self.shard_size}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: here we have to be careful that digest name do not contain the character -
. Either we add a check somewhere, or we use a different character, like $
, |
. Maybe you're thinking of another PR to customize this value anyway, since name-shard_size
could lead to confusion anyway since the sharding "algo" is not specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we could standardize on using .
everywhere between the fields.
So far, we control all the components so there's no risk of confusion. I'm pushing testing /more customization for this after we have most of the API in place (at least after we have the full manifest format), as we should be able to quickly fix issues.
Signed-off-by: Mihai Maruseac <[email protected]>
# Note: no "." at end here, it will be added by `join` on return. | ||
encoded_range = b"" | ||
|
||
return b".".join([encoded_type, encoded_name, encoded_range]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be more readable to add to the return
line... but does not work with the empty encoded_range :/
Summary
This is what used to be
serialize_v1
.Additionally, in this change we rename
serializing
toserialization
to be gramatically correct. We exposeshard_size
and a newdigest_size
method from all hashing engines. We also make imports be more consistent.Release Note
NONE
Documentation
NONE