-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?
When you do xr.open_dataset, a few main things happen:
- the data on disk is examined and a lazy representation built (which knows the data's
shapeanddtype) - decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
- materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations
When you do virtualizarr.open_virtual_dataset then also:
- a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the
ManifestArray), which also knows theshapeanddtype.
In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.
For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.
(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a virtualizarr.ManifestArray has all the information needed to actually go fetch data - in other words it could be converted directly to an actual zarr.Array (mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).
Imagine that we enabled the zarr.Array type (or some new VirtualZarrArray type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:
- Basically replace the
virtualizarr.ManifestArray, - Be wrapped by Xarray to provide both the "universal reader" of Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and also lazy slicing & concatenation operations (see Lazy indexing arrays as a stand-alone package #5081).
The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see VirtualZarrArrays wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via .compute or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what virtualizarr does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine ds.to_zarr having a boolean virtual kwarg to cover both cases.
The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).
All together this would give you:
-
Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)
-
Lazy Zarr arrays even without Xarray
-
Ability to save virtual datasets without needing a dedicated
ManifestArraytype (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself) -
Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.
-
Complete separation of:
- finding byte ranges from archival formats (VirtualiZarr / kerchunk readers for specific file formats),
- reading bytes (
zarr.Array), - decoding bytes following CF (new CF zarr codecs mentioned in Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and Expose a public interface for CF encoding/decoding functions #155),
- lazy operations (new lazy operations package),
- handling of named variables / dimensions (Xarray),
- serialization to metadata-only virtual Zarr store (
ds.to_zarr(path, virtual=True)callingVirtualZarrArray).
The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.
Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).
cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse