-
Notifications
You must be signed in to change notification settings - Fork 2
Add best practices for STAC Zarr and N-Dimensional Arrays #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
best-practices-zarr-ndarray.md
Outdated
| - Zarr v2: `"application/vnd+zarr; version=2"` | ||
| - Zarr v3: `"application/vnd+zarr; version=3"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably also add those in the table in asset and link best practices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed to leave everything in the best-practices-zarr-ndarray.md for the time being until we consolidate the principles and then eventually move the sections to the appropriate other guides.
cc @m-mohr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the version documented in the official ZARR docs or did we invent the version parameter here/in STAC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reference found in the Zarr official doc but the metdia-type is already adopted in pystac (https://pystac.readthedocs.io/en/stable/api/media_type.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For background on the pystac definition: stac-utils/pystac#1546
florianziemen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your efforts! The Climate and Weather example looks very good to me. I've suggested a few minor edits to match the original data and stac specs a bit better.
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Michele Claus <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
Co-authored-by: Julia Signell <[email protected]>
|
Apart from specifying the href to asset itself, the item should shave information how the asset pathsare being constructed and the assets should have information about the "groups" leading via a standard template to the assets themselves so: Template suggestion information like this could be at the item level and the assets should stick to it: Thus:
where An example of item self.href (link) for reference. self.item.href = "https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202510-s02msil2a-eu/14/products/cpm_v256/S2C_MSIL2A_20251014T142151_N0511_R096_T25WET_20251014T161521.zarr" |
Co-authored-by: Matthias Mohr <[email protected]>
|
|
||
| 1. **A Zarr asset SHALL reference a group containing one or more arrays or groups** | ||
|
|
||
| This is equivalent to an xarray Dataset or an xarray DataTree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to tag such asset with the role group (or similar) to make it easier for clients to find such assets programmatically.
…cube extension (#33) @emmanuelmathot this PR will merge the changes we discussed into your existing PR
|
@clausmichele @fabricebrito @Scartography Discussion for store moved here: radiantearth/stac-spec#1367 |
| - The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store` | ||
| - Assets include both the reference file and source data | ||
| - Role `"reference"` indicates virtual/indirect data access | ||
| - Role `"source"` indicates the underlying data files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think these points match the example above, since the Assets don't include a reference file. Should Virtual Zarr files follow the same principle of the normal Zarr files if they reference a group of arrays they belong in Assets if they are a Zarr store they should be included in the links?
|
linking for mime-type: stac-utils/pystac#1546 |
|
For the media type, there's no official registration at IANA yet. If OGC(?) would register it as part of the community standard process, we should probably make them aware that we'd appreciate a version parameter being registered. |
best-practices-zarr-ndarray.md
Outdated
|
|
||
| Individual arrays within the store SHOULD NOT be represented as separate assets. | ||
|
|
||
| The appropriate level depends on how users will access the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weather and climate typically has deep hierarchical nesting, with multiple layers of subgrouping. There is not really a single appropriate level for users as a whole.
This is something we already struggle with for representing datacube data in STAC, when we are trying to map STAC to our raw datacubes (non-Zarr). We proposed the linked templates extension, with the ability to apply this to child links, to handle this problem neatly:
stac-extensions/link-templates#1
I think the same issue is surfacing here (but instead of catalogs and items, its assets and variables/bands). Flattening the multi-dimensional cube into a list of assets just doesn't do justice to the n-dimensional structure of a datacube. Also the size becomes enormous -- it can work for lower-dimension or smaller datacubes, but not for larger datacubes.
Using the linkTemplate, as described further down this document, for individual arrays helps a little, but I think linkTemplating should be allowed at the asset/group level, then we could do something like:
"assets": {
"forecast": {
"type": "application/vnd+zarr; version=3",
"title": "Ensemble Forecast for <date>",
"linkTemplate": {
"rel": "data",
"title": "Forecast field",
"uriTemplate": "s3://bucket/path/forecast.zarr/{ensemble_member}/{step}/",
"variables": {
"ensemble_member": {
"description": "Index or identifier of ensemble member",
"type": "string",
"enum": [
"1",
"2",
"3",
...
"50"
]
},
"step": {
"description": "Forecast lead time (e.g., 6h, 12h, 24h)",
"type": "string",
"enum": [
"1",
"2",
"3",
...
"360"
]
}
}
}
}
}
This only shows a 2D example but would generalise pretty well to any N-dimensional structure. Plain enum's used here but you could also use more appropriate template schemas.
Within these groups we still have a large number of variables/"bands" which also benefit from link templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the step level pointing to? Group or array?
m-mohr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having an issue with this title / scoping:
Either we define best practices that apply generally for n-d arrays (aka datacubes) or it's a ZARR best practice. The mixture right now is not ideal, given that there is also another general datacube best practice evolving here: https://github.com/EOEPCA/datacube-access/blob/main/best_practices/stac_best_practices.md
If it's just for ZARR (and it's ancestry netCDF), it should only claim that but the document can keep as is.
If it's meant to be more generic, then we should merge the two best practices.
I'm open to both variants.
In a nutshell, in the mentioned evolving document we tried to cover data cubes in general. It is split into
Since many collections are based on the non n-d-formats it's also relevant to have these best practices. |
|
|
||
| **Key Points**: | ||
|
|
||
| - The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kerchunk reference file is considered as the data store couples the convention to the kerchunk implementation. I think this should say the collection should include a link with rel:store which points to the entrypoint of the virtual Zarr store.
| **Key Points**: | ||
|
|
||
| - The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store` | ||
| - Assets include both the reference file and source data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think this ties the convention to the kerchunk implementation. Common across Zarr stores is that there should is a single entrypoint (i.e. URL) so I feel like that should be the only standard practice. In my opinion, its up to the implementer if they want to also include source data as assets. For both virtual zarr stores (where the data source may be netcdf files) and native zarr stores (where the data source are zarr chunks), how data is stored should be largely abstracted from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the example, I would assume this is an example for an item, not a collection. Is that right? I think we should have an example for both items and collections and make sure the practice is not uniquely applicable to kerchunk.
Co-authored-by: Aimee Barciauskas <[email protected]>
This first PRs captures roughly the discussion of Day #1 of the STAC Sprint 2025 in Rome.
It needs to be refined and maybe split in more PRs.