Description
Describe the issue linked to the documentation
tl;dr: A page is missing from the Zarr docs which distinguishes between the different layers of specs, APIs, ABCs, formats etc.
It took me a year of working closely with zarr (via VirtualiZarr) to fully understand that "Zarr" is a multi-layered thing, which you can opt-in and out of at many different levels.
What is zarr?
My current understanding is that "Zarr" encompasses all of these things:
- A canonical on-disk file format, for both file and object storage, sometimes known as "native zarr",
- A specification for how to serialize and de-serialize array data and metadata as byte streams to an arbitrary key-value store,
- A python ABC for
Store
subclasses, which are key-value stores with a standardized API implementing the spec, but otherwise can do whatever they like behind the scenes, including not writing using the "native zarr" format, - A set of canonical python
Store
implementations, which generally do write using the "native zarr" format, - A python API for interacting with those
Store
subclasses, allowing python client code to treat many storage systems as interchangable. - A set of informal extensions, metadata standards, and a nascent framework for formalizing the extensions.
I feel this nuance and hierarchy is not clearly documented anywhere.
"Zarr" projects and how they fit in
It matters because there are now many projects which opt-in to some of these layers but not others. For example:
- The "zarr specification" is (2) and only (2), it doesn't actually touch on any of the other layers.
- Much of the data in the wild today follows (1), even though AFAIK that layout isn't actually formally described anywhere official?! It obeys (2), but we're just lucky that the mapping from file/object storage to a KV store is so obvious that it's still easy to write readers implemented in any language without a formal description of (1).
- Zarr-python
- Zarr-python's
zarr.abc
provides (3), - Zarr-python's
zarr.storage
provides (4), which writes to local and object storage using (1) but without explicitly noting that, - Zarr-python's
zarr.api
provides (5), and can only interact with implementations of (3), such as (4),
- Zarr-python's
- Zarr implementations in other languages (such as
zarr-js
) generally use (1) as their format, following (2), and take vague inspiration from (4) and (5).- Tensorstore is included in that category, as it uses (1) on disk.
- VirtualiZarr's new
ManifestStore
class is a concrete implementation of (3), but it eschews (1), with the aim of allowing access to an extensible set of non-zarr data formats on disk via (5). - Icechunk
- Icechunk's python API implements (3) so that it can be used with (5),
- Icechunk's spec is an alternative format to (1), but still follows (2),
- Icechunk's rust client obeys (2) but otherwise nothing else IIUC,
- A non-python library binding to Icechunk's rust client (as
zarrs
has done) would follow (2) and the Icechunk spec, potentially with API inspiration from (3) and (5) but otherwise nothing else formal.
- Xarray is a key user of (5), but also quietly does a few things that falls under (6) (e.g. the
coordinates
attribute it adds), - GeoZarr and other extension efforts are only about (6),
- OME-Zarr is possibly also just (6)?
- Consolidated metadata is an example of (6), but one that's supported by (3), (4), and (5) (but not necessarily by other
Store
implementations). - Kerchunk has it's own replacement for (1), but to read it you need to use the
FSSpecStore
implementation that's part of (4).
Current docs
The main zarr homepage only says
Zarr is a community project to develop specifications and software for storage of large N-dimensional typed arrays
which is true but insufficient.
This is a problem because it leads to confusion as to what zarr "is" - for example many people understandably but mistakenly think that (1) is Zarr, and that (2) is the specification for this file format. It also makes it harder for potential contributors (e.g. @nenb) to place their ideas within the framework of the zarr project.
I think this separation of layers is awesome, I just wish I could have understood it a year earlier via reading the docs, instead of having to have it explained to me one-to-one by the likes of @d-v-b, @jhamman, and @rabernat.
cc also @maxrjones @paraseba
Suggested fix for documentation
We should have a new page on the main Zarr docs explaining these layers, and a page on the zarr-python docs explaining how it fits into this framework. Other projects such as Icechunk and VirtualiZarr can then more easily explain their relationship to Zarr.