Skip to content

Add components and flexibility pages #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
53318b9
hagne description to point to components page, zarr-python, and the i…
TomNicholas Apr 4, 2025
d91c743
add compression as another key feature of zarr
TomNicholas Apr 4, 2025
237b674
describe abstract components
TomNicholas Apr 4, 2025
072d3c5
add section on concrete components
TomNicholas Apr 4, 2025
96cbd4d
add heading for section on flexibility
TomNicholas Apr 4, 2025
750e99c
make each sentence a new line
TomNicholas Apr 4, 2025
4a24927
add section on extensions
TomNicholas Apr 4, 2025
dfd10e4
add section on TensorStore
TomNicholas Apr 4, 2025
d6d0c13
add extensions, icechunk, and mongodb
TomNicholas Apr 4, 2025
0469920
NCZarr and Lindi
TomNicholas Apr 5, 2025
3ab1fbd
add virtualizarr
TomNicholas Apr 5, 2025
1bb6238
format onto one sentence per line
TomNicholas Apr 5, 2025
5a655e6
virtualizarr clarifications
TomNicholas Apr 5, 2025
13fc385
linebreak
TomNicholas Apr 5, 2025
3514d41
don't imply metadata are serialized as byte streams
TomNicholas Apr 5, 2025
8179105
add to sidebar and fix link
TomNicholas Apr 5, 2025
d16deab
fix some links
TomNicholas Apr 5, 2025
bda516f
redirection layer
TomNicholas Apr 5, 2025
6d73ece
specification->protocol
TomNicholas Apr 5, 2025
180b5ba
organize sidebar better
TomNicholas Apr 5, 2025
a1dcf26
create separate page to describe flexibility
TomNicholas Apr 5, 2025
1c05a9c
add types of flexibility
TomNicholas Apr 5, 2025
b487799
more links between pages
TomNicholas Apr 5, 2025
1f9613c
link to external example libraries
TomNicholas Apr 5, 2025
22df358
add flexibility as a feature
TomNicholas Apr 5, 2025
76cf0ef
add more applications
TomNicholas Apr 5, 2025
fbdfcda
split the applications and the features better
TomNicholas Apr 5, 2025
87cb05c
be consistent about bullet points
TomNicholas Apr 5, 2025
d499af0
Spelling
TomNicholas Apr 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions components/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
layout: single
author_profile: false
title: Zarr Components
sidebar:
title: "Content"
nav: sidebar
---

Zarr consists of several components, both abstract and concrete.
These span both the physical storage layer and the conceptual structural layer.
Zarr-related projects all follow the Zarr specification (and hence data model), but otherwise may choose to implement other layers however they wish.

## Abstract components

These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.

**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and metadata as byte streams

small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed by 3514d41

A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).


**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.

Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec, i.e. the protocol

I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my long comment below: #131 (comment)

It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic.

**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
Copy link
Member Author

@TomNicholas TomNicholas Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay for me to enshrine the name "Native Zarr Format" here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "native" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.

Most, but not all, zarr implementations will serialize to this format.
Comment on lines +25 to +26
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs an explicit section in the specification, even if it's pretty trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.


**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be enforced by implementations or client libraries however they like, but generally should be opt-in.

## Concrete components

Concrete implementations of the abstract components can be implemented in any language.
The canonical reference implementation is [Zarr-Python](https://github.com/zarr-developers/zarr-python), but there are many [other implementations](https://zarr.dev/implementations/).
Zarr-Python contains reference examples of useful constructs that can be re-implemented in other languages.

**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API.
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing.


**Zarr-Python Store Implementations**: Zarr-python's [`zarr.storage`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains concrete implementations of the `Store` ABC for interacting with particular storage systems, such as a local filesystem or object storage. These write data in the Native Zarr Format.
It's expected that most users of zarr will just use one of these implementations.

**Zarr-Python User API**: Zarr-python's [`zarr.api`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains functions and classes for interacting with any concrete implementation of the `zarr.abc.Store` interface.
This allows user applications to use a standard zarr API to read and write from a variety of common storage systems.

## Component Flexibility

One of Zarr's greatest strengths is its flexibility.
Here are a few interesting zarr-related projects, with descriptions of how they do or don't make use of different zarr components.

TODO
13 changes: 4 additions & 9 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,9 @@ can be represented as a key-value store, including most commonly POSIX file
systems and cloud object storage but also zip files as well as relational and
document databases.

See the following GitHub repositories for more information:

* [Zarr Python](https://github.com/zarr-developers/zarr)
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs)
* [Numcodecs](https://github.com/zarr-developers/numcodecs)
* [Z5](https://github.com/constantinpape/z5)
* [N5](https://github.com/saalfeldlab/n5)
* [Zarr.jl](https://github.com/meggart/Zarr.jl)
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala)
Comment on lines -35 to -43
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!

For more details read about the various [Components of Zarr](https://zarr.dev/componenets/),
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation,
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language.

## Applications

Expand All @@ -51,6 +45,7 @@ See the following GitHub repositories for more information:
## Features

* Chunk multi-dimensional arrays along any dimension.
* Compress array chunks via an extensible system of compressors.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like a important omission.

* Store arrays in memory, on disk, inside a Zip file, on S3, etc.
* Read and write arrays concurrently from multiple threads or processes.
* Organize arrays into hierarchies via annotatable groups.
Expand Down