Skip to content

Conversation

@emmanuelmathot
Copy link
Contributor

This first PRs captures roughly the discussion of Day #1 of the STAC Sprint 2025 in Rome.

It needs to be refined and maybe split in more PRs.

Comment on lines 117 to 118
- Zarr v2: `"application/vnd+zarr; version=2"`
- Zarr v3: `"application/vnd+zarr; version=3"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also add those in the table in asset and link best practices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to leave everything in the best-practices-zarr-ndarray.md for the time being until we consolidate the principles and then eventually move the sections to the appropriate other guides.
cc @m-mohr

Copy link
Collaborator

@m-mohr m-mohr Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the version documented in the official ZARR docs or did we invent the version parameter here/in STAC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reference found in the Zarr official doc but the metdia-type is already adopted in pystac (https://pystac.readthedocs.io/en/stable/api/media_type.html)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For background on the pystac definition: stac-utils/pystac#1546

Copy link

@florianziemen florianziemen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your efforts! The Climate and Weather example looks very good to me. I've suggested a few minor edits to match the original data and stac specs a bit better.

@Scartography
Copy link

Scartography commented Oct 15, 2025

Apart from specifying the href to asset itself, the item should shave information how the asset pathsare being constructed and the assets should have information about the "groups" leading via a standard template to the assets themselves so:

Template suggestion information like this could be at the item level and the assets should stick to it:
self.item.assets[0].href = self.item.href/{group1}/{group2}/{group3}.../{asset/band_name/data_entry}

Thus:

https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202510-s02msil2a-eu/14/products/cpm_v256/S2C_MSIL2A_20251014T142151_N0511_R096_T25WET_20251014T161521.zarr/meassurement/reflectance/r10m/b02

where group1=meassurement, group2=r10m, data_entry=b02, this should work for any group names, which should be specified in the assets.

An example of item self.href (link) for reference.

self.item.href = "https://objects.eodc.eu:443/e05ab01a9d56408d82ac32d69a5aae2a:202510-s02msil2a-eu/14/products/cpm_v256/S2C_MSIL2A_20251014T142151_N0511_R096_T25WET_20251014T161521.zarr"


1. **A Zarr asset SHALL reference a group containing one or more arrays or groups**

This is equivalent to an xarray Dataset or an xarray DataTree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to tag such asset with the role group (or similar) to make it easier for clients to find such assets programmatically.

…cube extension (#33)

@emmanuelmathot this PR will merge the changes we discussed into your
existing PR
@emmanuelmathot
Copy link
Contributor Author

@clausmichele @fabricebrito @Scartography Discussion for store moved here: radiantearth/stac-spec#1367

Comment on lines +719 to +722
- The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store`
- Assets include both the reference file and source data
- Role `"reference"` indicates virtual/indirect data access
- Role `"source"` indicates the underlying data files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think these points match the example above, since the Assets don't include a reference file. Should Virtual Zarr files follow the same principle of the normal Zarr files if they reference a group of arrays they belong in Assets if they are a Zarr store they should be included in the links?

@emmanuelmathot
Copy link
Contributor Author

linking for mime-type: stac-utils/pystac#1546

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 16, 2025

For the media type, there's no official registration at IANA yet. If OGC(?) would register it as part of the community standard process, we should probably make them aware that we'd appreciate a version parameter being registered.


Individual arrays within the store SHOULD NOT be represented as separate assets.

The appropriate level depends on how users will access the data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weather and climate typically has deep hierarchical nesting, with multiple layers of subgrouping. There is not really a single appropriate level for users as a whole.

This is something we already struggle with for representing datacube data in STAC, when we are trying to map STAC to our raw datacubes (non-Zarr). We proposed the linked templates extension, with the ability to apply this to child links, to handle this problem neatly:
stac-extensions/link-templates#1

I think the same issue is surfacing here (but instead of catalogs and items, its assets and variables/bands). Flattening the multi-dimensional cube into a list of assets just doesn't do justice to the n-dimensional structure of a datacube. Also the size becomes enormous -- it can work for lower-dimension or smaller datacubes, but not for larger datacubes.

Using the linkTemplate, as described further down this document, for individual arrays helps a little, but I think linkTemplating should be allowed at the asset/group level, then we could do something like:

"assets": {
  "forecast": {
    "type": "application/vnd+zarr; version=3",
    "title": "Ensemble Forecast for <date>",
    "linkTemplate": {
      "rel": "data",
      "title": "Forecast field",
      "uriTemplate": "s3://bucket/path/forecast.zarr/{ensemble_member}/{step}/",
      "variables": {
        "ensemble_member": {
          "description": "Index or identifier of ensemble member",
          "type": "string",
          "enum": [
            "1",
            "2",
            "3",
            ...
            "50"
          ]
        },
        "step": {
          "description": "Forecast lead time (e.g., 6h, 12h, 24h)",
          "type": "string",
          "enum": [
            "1",
            "2",
            "3",
            ...
            "360"
          ]
        }
      }
    }
  }
}

This only shows a 2D example but would generalise pretty well to any N-dimensional structure. Plain enum's used here but you could also use more appropriate template schemas.

Within these groups we still have a large number of variables/"bands" which also benefit from link templates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the step level pointing to? Group or array?

Copy link
Collaborator

@m-mohr m-mohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having an issue with this title / scoping:

Either we define best practices that apply generally for n-d arrays (aka datacubes) or it's a ZARR best practice. The mixture right now is not ideal, given that there is also another general datacube best practice evolving here: https://github.com/EOEPCA/datacube-access/blob/main/best_practices/stac_best_practices.md

If it's just for ZARR (and it's ancestry netCDF), it should only claim that but the document can keep as is.
If it's meant to be more generic, then we should merge the two best practices.

I'm open to both variants.

@przell
Copy link

przell commented Nov 19, 2025

If it's just for ZARR (and it's ancestry netCDF), it should only claim that but the document can keep as is.
If it's meant to be more generic, then we should merge the two best practices.

In a nutshell, in the mentioned evolving document we tried to cover data cubes in general. It is split into

  • "Datacubes"
    • For datacube native formats (e.g. zarr)
    • It's a stub - the best practices defined here are great and would replace this part.
  • "Raster Data"
    • Construct data cubes from formats that are not "data cube native" (COGs, JPEG2000)
    • Deconstruct data cubes into these formats.

Since many collections are based on the non n-d-formats it's also relevant to have these best practices.
How to have them organized is up for discussion: One document or two. A link between them (the two use cases) would be nice though.


**Key Points**:

- The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kerchunk reference file is considered as the data store couples the convention to the kerchunk implementation. I think this should say the collection should include a link with rel:store which points to the entrypoint of the virtual Zarr store.

**Key Points**:

- The kerchunk reference file is considered as the data store and thus is reference as a link with `rel: store`
- Assets include both the reference file and source data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think this ties the convention to the kerchunk implementation. Common across Zarr stores is that there should is a single entrypoint (i.e. URL) so I feel like that should be the only standard practice. In my opinion, its up to the implementer if they want to also include source data as assets. For both virtual zarr stores (where the data source may be netcdf files) and native zarr stores (where the data source are zarr chunks), how data is stored should be largely abstracted from the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the example, I would assume this is an example for an item, not a collection. Is that right? I think we should have an example for both items and collections and make sure the practice is not uniquely applicable to kerchunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.