-
Notifications
You must be signed in to change notification settings - Fork 18
Add components and flexibility pages #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 7 commits
53318b9
d91c743
237b674
072d3c5
96cbd4d
750e99c
4a24927
dfd10e4
d6d0c13
0469920
3ab1fbd
1bb6238
5a655e6
13fc385
3514d41
8179105
d16deab
bda516f
6d73ece
180b5ba
a1dcf26
1c05a9c
b487799
1f9613c
22df358
76cf0ef
fbdfcda
87cb05c
d499af0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
--- | ||
layout: single | ||
author_profile: false | ||
title: Zarr Components | ||
sidebar: | ||
title: "Content" | ||
nav: sidebar | ||
--- | ||
|
||
Zarr consists of several components, both abstract and concrete. | ||
These span both the physical storage layer and the conceptual structural layer. | ||
Zarr-related projects all follow the Zarr specification (and hence data model), but otherwise may choose to implement other layers however they wish. | ||
|
||
## Abstract components | ||
|
||
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. | ||
|
||
**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). | ||
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array). |
||
|
||
**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my long comment below: #131 (comment) |
||
It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic. | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it okay for me to enshrine the name "Native Zarr Format" here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does "native" mean here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to. |
||
Most, but not all, zarr implementations will serialize to this format. | ||
Comment on lines
+25
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this needs an explicit section in the specification, even if it's pretty trivial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context. |
||
|
||
**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be enforced by implementations or client libraries however they like, but generally should be opt-in. | ||
|
||
## Concrete components | ||
|
||
Concrete implementations of the abstract components can be implemented in any language. | ||
The canonical reference implementation is [Zarr-Python](https://github.com/zarr-developers/zarr-python), but there are many [other implementations](https://zarr.dev/implementations/). | ||
Zarr-Python contains reference examples of useful constructs that can be re-implemented in other languages. | ||
|
||
**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API. | ||
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing. |
||
|
||
**Zarr-Python Store Implementations**: Zarr-python's [`zarr.storage`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains concrete implementations of the `Store` ABC for interacting with particular storage systems, such as a local filesystem or object storage. These write data in the Native Zarr Format. | ||
It's expected that most users of zarr will just use one of these implementations. | ||
|
||
**Zarr-Python User API**: Zarr-python's [`zarr.api`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains functions and classes for interacting with any concrete implementation of the `zarr.abc.Store` interface. | ||
This allows user applications to use a standard zarr API to read and write from a variety of common storage systems. | ||
|
||
## Component Flexibility | ||
|
||
One of Zarr's greatest strengths is its flexibility. | ||
Here are a few interesting zarr-related projects, with descriptions of how they do or don't make use of different zarr components. | ||
TomNicholas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
TODO |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,15 +32,9 @@ can be represented as a key-value store, including most commonly POSIX file | |
systems and cloud object storage but also zip files as well as relational and | ||
document databases. | ||
|
||
See the following GitHub repositories for more information: | ||
|
||
* [Zarr Python](https://github.com/zarr-developers/zarr) | ||
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs) | ||
* [Numcodecs](https://github.com/zarr-developers/numcodecs) | ||
* [Z5](https://github.com/constantinpape/z5) | ||
* [N5](https://github.com/saalfeldlab/n5) | ||
* [Zarr.jl](https://github.com/meggart/Zarr.jl) | ||
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala) | ||
Comment on lines
-35
to
-43
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for! |
||
For more details read about the various [Components of Zarr](https://zarr.dev/componenets/), | ||
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation, | ||
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language. | ||
|
||
## Applications | ||
|
||
|
@@ -51,6 +45,7 @@ See the following GitHub repositories for more information: | |
## Features | ||
|
||
* Chunk multi-dimensional arrays along any dimension. | ||
* Compress array chunks via an extensible system of compressors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seemed like a important omission. |
||
* Store arrays in memory, on disk, inside a Zip file, on S3, etc. | ||
* Read and write arrays concurrently from multiple threads or processes. | ||
* Organize arrays into hierarchies via annotatable groups. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed by 3514d41