relationship with zarr #742

d-v-b · 2025-04-22T14:21:57Z

I think we need to make a formal decision on the relationship between numcodecs and zarr. This is kind of an existential issue for this repo and zarr-python, so it would be good to have a robust discussion here.

today's problems

here are some of the problems we have today:

changes to numcodecs can silently introduce compatibility problems for zarr implementations (see (feat): typesize declared with constructor for Blosc #713)
separate zarr v2 and zarr v3 codecs is confusing to users.
this project has a circular dependency with zarr-python, which is really bad.

Prior to zarr v3, numcodec de facto defined the specs for the codecs zarr used. Numcodecs defines a dict serialization for codecs that matches the requirement of the zarr v2 spec: codecs take the form of a dict with an "id" field, which is a string identifying the codec, and any number of additional fields that parametrize the codec.

This is problematic because zarr-python uses this dict serialization to write codecs to JSON, but zarr-python does not check that the dict form of a codec is spec-compliant (because there is no spec). So zarr-python will silently propagate broken zarr metadata if numcodecs changes (this happened in zarr-developers/zarr-python#713, and zarr-developers/zarr-python#519)

zarr v3 revealed another problem with this relationship: zarr v3 defined a new JSON serialization scheme for codecs, which is incompatible with the zarr v2 scheme (i.e., the scheme used by numcodecs). But the same logical codec (e.g., gzip) can be used in zarr v2 or zarr v3 metadata. So, there must be an abstraction somewhere that can represent the fact that gzip compression is expressed as {"id": "gzip", "level": 1} in zarr v2, but {"name": "gzip", "configuration": {"level": 1}} in zarr v3. The current solution is to dynamically create zarr v3-compatible codec classes from existing zarr v2-compatible codecs. This has been serviceable so far, but long term we have to do something better than this. I don't see a future for dynamically creating a new gzip codec just to support a different dict serialization, and I don't see a future for numcodecs depending on zarr-python (which depends on numcodecs).

some ideas

Broadly speaking I think we should remove the explicit and implicit zarr dependency from numcodecs. This would result in making numcodecs independent of zarr v2 or v3. Instead, numcodecs codecs should define all the information necessary for a zarr-aware application like zarr-python to wrap numcodecs classes in zarr-specific classes.

I see two ways this could manifest:

For codecs with extensive support across multiple zarr implementations, like gzip or blosc, zarr-python should define its own JSON serialization that is consistent with what other implementations expect (for v2 and v3). If numcodecs pushes changes that break that expectation (like Caterva inside Zarr zarr-python#713 or zarr slower than npy, hdf5 etc? zarr-python#519), then zarr-python would report failures with that numcodecs version and we could address the problem quickly. In the status quo, these breaking changes pass through zarr-python silently, which is not ideal.
For codecs that are defined in numcodecs, but not used across multiple zarr implementations, then zarr-python could automatically wrap these codecs with a zarr-python-specific codec class.
We define a codec protocol that the codecs in numcodecs implement, but which can be implemented by other codecs. Zarr python should have routines for generating zarr v2 and v3-compatible metadata from any codec that implements this protocol. Ultimately this can result in relaxing the numcodecs requirement for zarr-python.

Maybe other people disagree with these ideas. If so, we should discuss here, but we should also consider higher-bandwidth communication like a community meeting to sort this out. I really feel like the current zarr / numcodecs relationship is untenable long term. It would be great to quickly define a new strategy that can work for as many people as possible.

cc @zarr-developers/python-core-devs @LDeakin @jbms

The text was updated successfully, but these errors were encountered:

TomNicholas · 2025-04-23T19:07:38Z

Another problem: zarr-python tests that everything is pickleable but numcodecs doesn't, and as a result numcodecs codecs are not pickleable: #744

normanrz · 2025-04-24T15:55:17Z

I think the main purpose for numcodecs should be to provide built wheels for zarr-python. With that we would pin the numcodecs version in a zarr-python release and zarr-python would be responsible to adapt the metadata for all codecs. We could also consider going one step further and fully integrating numcodecs into zarr-python.

martindurant · 2025-04-24T15:57:03Z

We could also consider going one step further and fully integrating numcodecs into zarr-python.

You probably don't want to complicate the build and install process for zarr-python, though

normanrz · 2025-04-24T15:59:40Z

We could also consider going one step further and fully integrating numcodecs into zarr-python.

You probably don't want to complicate the build and install process for zarr-python, though

We might be able to alleviate that with conditional building and caching in the CI.

normanrz mentioned this issue Apr 23, 2025

add numcodecs.zarr3.to_zarr3 method #741

Draft

7 tasks

normanrz mentioned this issue Apr 24, 2025

adds codecs that numcodecs defines zarr-developers/zarr-extensions#2

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relationship with zarr #742

relationship with zarr #742

d-v-b commented Apr 22, 2025

TomNicholas commented Apr 23, 2025

normanrz commented Apr 24, 2025

martindurant commented Apr 24, 2025

normanrz commented Apr 24, 2025

relationship with zarr #742

relationship with zarr #742

Comments

d-v-b commented Apr 22, 2025

today's problems

some ideas

TomNicholas commented Apr 23, 2025

normanrz commented Apr 24, 2025

martindurant commented Apr 24, 2025

normanrz commented Apr 24, 2025