-
Notifications
You must be signed in to change notification settings - Fork 97
relationship with zarr #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another problem: zarr-python tests that everything is pickleable but numcodecs doesn't, and as a result numcodecs codecs are not pickleable: #744 |
I think the main purpose for numcodecs should be to provide built wheels for zarr-python. With that we would pin the numcodecs version in a zarr-python release and zarr-python would be responsible to adapt the metadata for all codecs. We could also consider going one step further and fully integrating numcodecs into zarr-python. |
You probably don't want to complicate the build and install process for zarr-python, though |
We might be able to alleviate that with conditional building and caching in the CI. |
I think we need to make a formal decision on the relationship between numcodecs and zarr. This is kind of an existential issue for this repo and zarr-python, so it would be good to have a robust discussion here.
today's problems
here are some of the problems we have today:
typesize
declared with constructor forBlosc
#713)zarr-python
, which is really bad.Prior to zarr v3, numcodec de facto defined the specs for the codecs zarr used. Numcodecs defines a dict serialization for codecs that matches the requirement of the zarr v2 spec: codecs take the form of a dict with an
"id"
field, which is a string identifying the codec, and any number of additional fields that parametrize the codec.This is problematic because zarr-python uses this dict serialization to write codecs to JSON, but zarr-python does not check that the dict form of a codec is spec-compliant (because there is no spec). So zarr-python will silently propagate broken zarr metadata if numcodecs changes (this happened in zarr-developers/zarr-python#713, and zarr-developers/zarr-python#519)
zarr v3 revealed another problem with this relationship: zarr v3 defined a new JSON serialization scheme for codecs, which is incompatible with the zarr v2 scheme (i.e., the scheme used by numcodecs). But the same logical codec (e.g., gzip) can be used in zarr v2 or zarr v3 metadata. So, there must be an abstraction somewhere that can represent the fact that gzip compression is expressed as
{"id": "gzip", "level": 1}
in zarr v2, but{"name": "gzip", "configuration": {"level": 1}}
in zarr v3. The current solution is to dynamically create zarr v3-compatible codec classes from existing zarr v2-compatible codecs. This has been serviceable so far, but long term we have to do something better than this. I don't see a future for dynamically creating a new gzip codec just to support a different dict serialization, and I don't see a future for numcodecs depending on zarr-python (which depends on numcodecs).some ideas
Broadly speaking I think we should remove the explicit and implicit zarr dependency from numcodecs. This would result in making numcodecs independent of zarr v2 or v3. Instead, numcodecs codecs should define all the information necessary for a zarr-aware application like zarr-python to wrap numcodecs classes in zarr-specific classes.
I see two ways this could manifest:
Maybe other people disagree with these ideas. If so, we should discuss here, but we should also consider higher-bandwidth communication like a community meeting to sort this out. I really feel like the current zarr / numcodecs relationship is untenable long term. It would be great to quickly define a new strategy that can work for as many people as possible.
cc @zarr-developers/python-core-devs @LDeakin @jbms
The text was updated successfully, but these errors were encountered: