Skip to content

relationship with zarr #742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d-v-b opened this issue Apr 22, 2025 · 4 comments
Open

relationship with zarr #742

d-v-b opened this issue Apr 22, 2025 · 4 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 22, 2025

I think we need to make a formal decision on the relationship between numcodecs and zarr. This is kind of an existential issue for this repo and zarr-python, so it would be good to have a robust discussion here.

today's problems

here are some of the problems we have today:

  • changes to numcodecs can silently introduce compatibility problems for zarr implementations (see (feat): typesize declared with constructor for Blosc #713)
  • separate zarr v2 and zarr v3 codecs is confusing to users.
  • this project has a circular dependency with zarr-python, which is really bad.

Prior to zarr v3, numcodec de facto defined the specs for the codecs zarr used. Numcodecs defines a dict serialization for codecs that matches the requirement of the zarr v2 spec: codecs take the form of a dict with an "id" field, which is a string identifying the codec, and any number of additional fields that parametrize the codec.

This is problematic because zarr-python uses this dict serialization to write codecs to JSON, but zarr-python does not check that the dict form of a codec is spec-compliant (because there is no spec). So zarr-python will silently propagate broken zarr metadata if numcodecs changes (this happened in zarr-developers/zarr-python#713, and zarr-developers/zarr-python#519)

zarr v3 revealed another problem with this relationship: zarr v3 defined a new JSON serialization scheme for codecs, which is incompatible with the zarr v2 scheme (i.e., the scheme used by numcodecs). But the same logical codec (e.g., gzip) can be used in zarr v2 or zarr v3 metadata. So, there must be an abstraction somewhere that can represent the fact that gzip compression is expressed as {"id": "gzip", "level": 1} in zarr v2, but {"name": "gzip", "configuration": {"level": 1}} in zarr v3. The current solution is to dynamically create zarr v3-compatible codec classes from existing zarr v2-compatible codecs. This has been serviceable so far, but long term we have to do something better than this. I don't see a future for dynamically creating a new gzip codec just to support a different dict serialization, and I don't see a future for numcodecs depending on zarr-python (which depends on numcodecs).

some ideas

Broadly speaking I think we should remove the explicit and implicit zarr dependency from numcodecs. This would result in making numcodecs independent of zarr v2 or v3. Instead, numcodecs codecs should define all the information necessary for a zarr-aware application like zarr-python to wrap numcodecs classes in zarr-specific classes.

I see two ways this could manifest:

  • For codecs with extensive support across multiple zarr implementations, like gzip or blosc, zarr-python should define its own JSON serialization that is consistent with what other implementations expect (for v2 and v3). If numcodecs pushes changes that break that expectation (like Caterva inside Zarr zarr-python#713 or zarr slower than npy, hdf5 etc? zarr-python#519), then zarr-python would report failures with that numcodecs version and we could address the problem quickly. In the status quo, these breaking changes pass through zarr-python silently, which is not ideal.
  • For codecs that are defined in numcodecs, but not used across multiple zarr implementations, then zarr-python could automatically wrap these codecs with a zarr-python-specific codec class.
  • We define a codec protocol that the codecs in numcodecs implement, but which can be implemented by other codecs. Zarr python should have routines for generating zarr v2 and v3-compatible metadata from any codec that implements this protocol. Ultimately this can result in relaxing the numcodecs requirement for zarr-python.

Maybe other people disagree with these ideas. If so, we should discuss here, but we should also consider higher-bandwidth communication like a community meeting to sort this out. I really feel like the current zarr / numcodecs relationship is untenable long term. It would be great to quickly define a new strategy that can work for as many people as possible.

cc @zarr-developers/python-core-devs @LDeakin @jbms

@TomNicholas
Copy link
Member

Another problem: zarr-python tests that everything is pickleable but numcodecs doesn't, and as a result numcodecs codecs are not pickleable: #744

@normanrz
Copy link
Member

I think the main purpose for numcodecs should be to provide built wheels for zarr-python. With that we would pin the numcodecs version in a zarr-python release and zarr-python would be responsible to adapt the metadata for all codecs. We could also consider going one step further and fully integrating numcodecs into zarr-python.

@martindurant
Copy link
Member

We could also consider going one step further and fully integrating numcodecs into zarr-python.

You probably don't want to complicate the build and install process for zarr-python, though

@normanrz
Copy link
Member

We could also consider going one step further and fully integrating numcodecs into zarr-python.

You probably don't want to complicate the build and install process for zarr-python, though

We might be able to alleviate that with conditional building and caching in the CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants