diff --git a/standard/template/geozarr-spec.adoc b/standard/template/geozarr-spec.adoc index 5e4c5d5..b8cfec5 100644 --- a/standard/template/geozarr-spec.adoc +++ b/standard/template/geozarr-spec.adoc @@ -45,23 +45,23 @@ include::sections/clause_6_informative_text.adoc[] include::sections/clause_7_unified_data_model.adoc[] -include::sections/clause_8_conformance.adoc[] +// Discarded: include::sections/clause_8_conformance.adoc[] include::sections/clause_9_zarr_encoding.adoc[] -include::sections/clause_10_geotiff_encoding.adoc[] +// include::sections/clause_10_geotiff_encoding.adoc[] //// add or remove annexes after "A" as necessary //// -include::sections/annex-a.adoc[] +//include::sections/annex-a.adoc[] -include::sections/annex-n.adoc[] +// include::sections/annex-n.adoc[] //// Revision History should be the last annex before the Bibliography Bibliography should be the last annex //// -include::sections/annex-history.adoc[] +// include::sections/annex-history.adoc[] -include::sections/annex-bibliography.adoc[] +//include::sections/annex-bibliography.adoc[] diff --git a/standard/template/sections/clause_0_front_material.adoc b/standard/template/sections/clause_0_front_material.adoc index b9f7975..d6ba687 100644 --- a/standard/template/sections/clause_0_front_material.adoc +++ b/standard/template/sections/clause_0_front_material.adoc @@ -1,21 +1,21 @@ .Preface -The GeoZarr Unified Data Model and Encoding Standard defines a layered, standards-based framework for representing and encoding geospatial and scientific datasets in the Zarr format. It integrates foundational specifications such as the Unidata Common Data Model (CDM), the CF Conventions, and selected OGC and community standards to enable semantic, structural, and operational interoperability across Earth observation platforms and geospatial ecosystems. - -This Standard introduces a unified model that harmonises metadata structures, array-based data representations, coordinate referencing, and multiscale tiling semantics. It provides a coherent framework that facilitates encoding into Zarr v2 and v3, supporting scalable, cloud-native workflows. - -The purpose of this document is to provide implementation guidance and normative structure for consistent, interoperable adoption of GeoZarr across tools, platforms, and services. This work extends prior standardisation efforts within the OGC, including OGC API – Tiles, the Tile Matrix Set Standard, and EO metadata conventions, and anticipates integration with catalogue systems such as STAC. +The GeoZarr Standard defines a layered, standards-based framework for representing and encoding geospatial and scientific datasets in the Zarr format. The purpose of this document is to provide implementation guidance and normative structure for consistent, interoperable adoption of GeoZarr across tools, platforms, and services. This work extends prior standardisation efforts within the OGC, including OGC API – Tiles, the Tile Matrix Set Standard, and EO metadata conventions, and anticipates integration with catalogue systems such as STAC. This Standard has been developed in collaboration with contributors from Earth observation, climate science, geospatial analysis, and cloud-native geodata infrastructure communities. Future work may extend this model to additional storage formats, API services, and semantic layers. [abstract] == Abstract -The GeoZarr Unified Data Model and Encoding Standard specifies a conceptual and implementation framework for representing multidimensional, geospatial datasets using the Zarr format. This Standard builds upon the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, and introduces interoperable constructs for tiling, georeferencing, and metadata integration. +Zarr provides efficient chunked storage for n-dimensional arrays but do not provide with the semantic constructs required for geospatial and scientific data workflows. -The model defines core elements—dimensions, coordinate variables, data variables, attributes—and optional extensions for multi-resolution overviews, affine geotransforms, and STAC metadata. Encoding guidance is provided for Zarr Version 2 and Zarr Version 3, including chunking, group hierarchy, and metadata conventions. +GeoZarr defines an abstract data model and a set of conventions for representing geospatial and scientific datasets in the Zarr format: -GeoZarr aims to bridge scientific and geospatial communities by enabling round-trip transformations with formats such as NetCDF and GeoTIFF, and supporting compatibility with tools in the scientific Python and geospatial ecosystems. This Standard enables scalable, standards-compliant, and semantically rich data structures for cloud-native Earth observation applications. +- GeoZarr bridges the Unidata CDM and the Zarr format. GeoZarr establishes the link between the Unidata Common Data Model (CDM) and the Zarr format by defining how the semantic constructs of the CDM are represented within Zarr’s storage model. +- Supports community metadata standards like CF, GeoTIFF, and GDAL. +- Extends CDM for geospatial through multiscale overviews and affine transformations. + +By providing a standardized framework for geospatial semantics, GeoZarr enables scientific and geospatial applications to fully utilize cloud-native storage architectures while maintaining the rich metadata and coordinate referencing required for Earth observation workflows. The result is a modern, scalable approach to storing and accessing geospatial data that meets the needs of both data providers and consumers. == Submitters @@ -29,4 +29,4 @@ All questions regarding this submission should be directed to the editor or the |Brianna Pagán _(editor)_ | DevSeed |Ryan Abernathey| EarthMover | TBD | TBD -|=== \ No newline at end of file +|=== diff --git a/standard/template/sections/clause_1_scope.adoc b/standard/template/sections/clause_1_scope.adoc index 93a5d91..028e2c4 100644 --- a/standard/template/sections/clause_1_scope.adoc +++ b/standard/template/sections/clause_1_scope.adoc @@ -1,7 +1,31 @@ == Scope -The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic unified data model, the specification of its encoding into Zarr Version 2 and Version 3, and the establishment of extension points to support interoperability with external metadata and tiling standards. +The GeoZarr Standard defines a conceptual and implementation framework for representing and encoding geospatial and scientific datasets using the Zarr format. The scope of this Standard includes the definition of a format-agnostic data model, the specification of its encoding into Zarr Version 2 and Version 3, and a set of extensions to support affine transformations and overviews. -This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models, such as the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, with operational encoding formats suitable for cloud-native storage and analysis. +These capabilities are necessary for geospatial data because Zarr does not provide semantic constructs for geospatial data interpretation. Applications need to understand not just array shapes and values, but coordinate meanings, projection parameters, and scientific metadata. GeoZarr fills this gap without compromising Zarr's performance characteristics. -Typical use cases include the storage, transformation, discovery, and processing of raster and gridded data, data cubes with temporal or vertical dimensions, and catalogue-enabled datasets integrated with metadata standards such as STAC and OGC Tile Matrix Sets. +=== Why GeoZarr Exists + +Zarr, by design, is a low-level container for storing n-dimensional arrays and metadata. While this simplicity is a strength for performance and interoperability, it means Zarr lacks higher-level concepts that geospatial applications require: + +* *Coordinate Systems:* No native way to associate spatial or temporal meaning with array dimensions +* *Grid Mappings:* No standard mechanism for projection and coordinate reference system metadata +* *Semantic Metadata:* No conventions for units, standard names, or scientific attributes +* *Variable Relationships:* No formal distinction between coordinate variables and data variables + +These concepts are essential for geospatial workflows but must be layered on top of Zarr's array storage. GeoZarr provides this semantic layer through proven standards (Common Data Model and CF conventions) while preserving Zarr's cloud-native advantages. + +=== Relationship to Zarr Core Concepts + +GeoZarr builds upon Zarr's foundational concepts of <> and <>. A Zarr store provides the storage and retrieval interface (e.g., filesystem, cloud object storage), while a hierarchy defines the logical tree structure of groups and arrays within that store. GeoZarr specifies how to organize and structure hierarchies to support geospatial semantics, without modifying the underlying store interface. + +=== Use Cases and Applications + +This Standard addresses the needs of Earth observation, environmental monitoring, and geospatial analysis applications that require efficient, scalable access to multidimensional datasets. It enables the harmonisation of existing data models with operational encoding formats suitable for cloud-native storage and analysis. + +Typical use cases include: +* Storage and processing of raster and gridded data +* Management of data cubes with temporal or vertical dimensions +* Integration with catalogue systems through standardized metadata +* Multi-resolution tiling for efficient visualization and analysis +* Cloud-optimized access to large geospatial datasets diff --git a/standard/template/sections/clause_2_conformance.adoc b/standard/template/sections/clause_2_conformance.adoc index 52e1a7c..fd32b8f 100644 --- a/standard/template/sections/clause_2_conformance.adoc +++ b/standard/template/sections/clause_2_conformance.adoc @@ -1,5 +1,7 @@ == Conformance +> WARNING: This section should be ignored and requirements classes should be designed and summarized here once the specification is completed. + The GeoZarr Unified Data Model is structured around a modular set of requirements classes. These classes define the conformance criteria for datasets and implementations adopting the GeoZarr specification. Each class provides a distinct set of structural or semantic expectations, facilitating interoperability across a broad spectrum of geospatial and scientific use cases. The *Core* requirements class defines the minimal compliance necessary to claim conformance with the GeoZarr Unified Data Model. It is intentionally open and permissive, supporting incremental adoption and broad compatibility with existing Zarr tools and data models based on the Unidata Common Data Model (CDM). diff --git a/standard/template/sections/clause_4_terms_and_definitions.adoc b/standard/template/sections/clause_4_terms_and_definitions.adoc index 007f320..47d75eb 100644 --- a/standard/template/sections/clause_4_terms_and_definitions.adoc +++ b/standard/template/sections/clause_4_terms_and_definitions.adoc @@ -2,18 +2,34 @@ === Terms and definitions +GeoZarr specification inherits the terms from the following sources: + +* https://docs.unidata.ucar.edu/netcdf-java/5.2/userguide/common_data_model_overview.html#data-access-layer-object-model[Unidata Common Data Model] + +* https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#concepts-and-terminology[Zarr concepts and terminology]. + + +==== affine transformation + +An affine transformation is a geometric mapping that preserves points, straight lines, and parallelism. It combines linear transformations (such as rotation, scaling, reflection, or shear) with translation. + + ==== array A multidimensional, regularly spaced collection of values (e.g., raster data or gridded measurements), typically indexed by dimensions such as time, latitude, longitude, or spectral band. ==== chunk -A sub-array representing a partition of a larger array, used to optimise data access and storage. In Zarr, data is stored and accessed as a collection of independently compressed chunks. +A sub-array representing a partition of a larger array, used to optimize data access and storage. In Zarr, data is stored and accessed as a collection of independently compressed chunks. ==== coordinate variable A one-dimensional array whose values define the coordinate system for a dimension of one or more data variables. Typical examples include latitude, longitude, time, or vertical levels. +==== data model + +A data model is an *abstract*, conceptual framework that defines how data is structured, organized, and interpreted, independent of any particular storage medium or implementation. In contrast, a file format represents a concrete realization of this model, defining how the data is stored on disk. + ==== data variable An array containing the primary geospatial or scientific measurements of interest (e.g., temperature, reflectance). Data variables are defined over one or more dimensions and associated with attributes. @@ -22,29 +38,32 @@ An array containing the primary geospatial or scientific measurements of interes An index axis along which arrays are organised. Dimensions provide a naming and ordering scheme for accessing data in multidimensional arrays (e.g., `time`, `x`, `y`, `band`). -==== group +==== dataset -A container for datasets, variables, dimensions, and metadata in Zarr. Groups may be nested to represent a logical hierarchy (e.g., for resolutions or collections). +*Avoid using:* this term is overloaded and avoided in this document. A dataset usually represent a self-contained group of variables within a hierarchical data structure. They often share one or more dimensions and represent the unit that can be opened by a data access library (see <>) ==== metadata Structured information describing the content, context, and semantics of datasets, variables, and attributes. GeoZarr metadata includes CF attributes, geotransform definitions, and links to STAC metadata where applicable. -==== multiscale dataset +==== overview + +A downscaled representation of a variable that facilitates rapid data display and efficient zooming. Overviews provide lower-resolution versions of the original data, enabling quick visualization and access without reading the full-resolution array. Multiple overview levels may be generated to support progressive rendering across different scales. + +==== store -A dataset that includes multiple representations of the same data variable at varying spatial resolutions. Each resolution level is associated with a tile matrix from an OGC Tile Matrix Set. +A system that provides storage and retrieval operations for Zarr hierarchies, as defined in the https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#stores[Zarr core specification]. A store implements the abstract store interface and can be backed by various storage technologies such as filesystems, cloud object storage, or databases. GeoZarr hierarchies are stored within and accessed through Zarr stores. ==== tile matrix set A spatial tiling scheme defined by a hierarchy of zoom levels and consistent grid parameters (e.g., scale, CRS). Tile Matrix Sets enable spatial indexing and tiling of gridded data. -==== transform +[[variable-group]] +==== variable group -An affine transformation used to convert between grid coordinates and geospatial coordinates, typically defined using the GDAL GeoTransform convention. +A variable group is a container that includes a coherent collection of variables sharing the same dimensional structure and coordinate system ( and may contain additional variables or subgroups). It is conceptually equivalent to an xarray Dataset.. -==== unified data model (UDM) -A conceptual model that defines how to structure geospatial data in Zarr using CDM-based constructs, including support for coordinate referencing, metadata integration, and multiscale representations. === Abbreviated Terms diff --git a/standard/template/sections/clause_6_informative_text.adoc b/standard/template/sections/clause_6_informative_text.adoc index 591f4f1..1653072 100644 --- a/standard/template/sections/clause_6_informative_text.adoc +++ b/standard/template/sections/clause_6_informative_text.adoc @@ -1,30 +1,14 @@ [[overview]] == Overview -The GeoZarr Unified Data Model and Encoding Standard defines a conceptual and implementation framework for representing multidimensional geospatial data using the Zarr format. Developed under the guidance of the OGC GeoZarr Standards Working Group (SWG), the Standard establishes conventions for encoding scientific and Earth observation datasets in a way that promotes scalability, interoperability, and compatibility with cloud-native infrastructure. +The **GeoZarr Standard** defines an **abstract data model** and a set of **conventions** for representing and describing geospatial and scientific datasets using the **Zarr** format. -GeoZarr is built on widely adopted community standards, including the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions. It introduces additional extensions and structural constructs to support multi-resolution tiling, geospatial referencing, and catalogue-enabled metadata integration (e.g., STAC). +Zarr provides efficient, chunked storage for n-dimensional arrays but does not include the semantic constructs required for geospatial and scientific data workflows. The **Unidata Common Data Model (CDM)** addresses this gap by introducing essential concepts that structure information through **variables**, **groups**, **coordinates**, and **metadata**. This abstract data model provides the semantic framework that enables structured interpretation of array-based data on top of Zarr’s storage foundation. -This Standard provides both: +The **primary objective** of GeoZarr is to specify how the **CDM** is encoded within Zarr. GeoZarr provides normative rules for encoding these CDM concepts in Zarr and thereby standardises the encoding practices already adopted by CDM-compatible libraries such as **xarray** and **nczarr**, promoting consistent interpretation and interoperability across tools and platforms. -* **Core requirements**, which define minimal compliance to represent array-based datasets using CDM constructs in Zarr, supporting open and permissive adoption across use cases. -* **Modular extension classes**, which define additional capabilities such as time series support, affine geotransform referencing, multi-resolution overviews, and projection coordinates, in line with OGC and community practices. +By defining an **abstract model** based on the **CDM** and a corresponding **encoding for Zarr**, GeoZarr establishes an explicit relationship between **the conceptual structure of the data** and **its physical storage representation**. Zarr defines how data are stored and accessed as chunked, hierarchical arrays, while GeoZarr specifies how this stored structure represents the scientific and geospatial meaning of the dataset.. -These modular components enable GeoZarr to serve a wide range of applications—from basic EO data storage to high-performance, cloud-native visualisation and analytics workflows. - -=== Encodings - -GeoZarr supports encoding in both Zarr Version 2 and Zarr Version 3. Each version defines how arrays, groups, and metadata are stored within a directory-based structure. All metadata is encoded in JSON-compatible formats, ensuring both human readability and machine interoperability. - -Encoding guidelines include: - -* Hierarchical grouping of datasets via Zarr groups. -* Dimension indexing and binding via dimension metadata. -* Attribute-based metadata compliant with CF conventions. -* Multi-resolution overviews aligned with OGC Tile Matrix Sets. -* Optional integration of STAC metadata for discovery and cataloguing. - -JSON is the primary format for metadata, attributes, and structural declarations. Implementations are encouraged to support standardised naming conventions, EPSG code references, and structured metadata to facilitate search, validation, and transformation across platforms. - -GeoZarr does not prescribe a single interface for data access. Instead, it enables **serverless and cloud-native** data access strategies by aligning its model with chunked, parallelisable storage patterns that are optimised for use in object stores and analytical environments. +As a **secondary objective**, GeoZarr extends the **CDM base layer** with additional capabilities required for geospatial and cloud-native applications. These extensions include **multiscale overviews**, which enable the representation of data at multiple levels of detail, and **affine transformations**, which define the spatial relationship between data coordinates and real-world locations. All extensions remain fully aligned with the CDM framework. +The **CDM** base layer also provides a **generic framework** capable of hosting metadata from a wide range of community standards. GeoZarr encourages the use of the **Climate and Forecast (CF) Conventions**, which are themselves defined around the CDM model, without imposing them as mandatory. This flexibility also supports metadata from other domain-specific standards such as **GeoTIFF**, **GDAL**, and similar geospatial conventions. diff --git a/standard/template/sections/clause_7_unified_data_model.adoc b/standard/template/sections/clause_7_unified_data_model.adoc index 8af7598..d369d62 100644 --- a/standard/template/sections/clause_7_unified_data_model.adoc +++ b/standard/template/sections/clause_7_unified_data_model.adoc @@ -1,340 +1,174 @@ [obligation==informative] -== Unified Data Model +[[data-model]] +== GeoZarr Data Model === Scope and Purpose -This Standard defines a unified data model (UDM) that provides a conceptual framework for representing geospatial and scientific data in Zarr. The purpose of this model is to support standards-based interoperability across Earth observation systems and analytical environments, while preserving compatibility with existing data models and software ecosystems.. +The GeoZarr Data Model defines the abstract structure for representing geospatial and scientific gridded data within the GeoZarr framework. -The unified data model incorporates and extends the following established specifications and community standards: +The GeoZarr Data Model serves the following purposes: -- **Unidata Common Data Model (CDM)** – Provides the foundational resource structure for scientific datasets, encompassing dimensions, coordinate systems, variables, and associated metadata elements. -- **CF (Climate and Forecast) Conventions** – Defines a widely adopted metadata profile for describing spatiotemporal semantics in CDM-based datasets. -- **Selected constructs from related Standards and practices**, including: - - The **OGC Tile Matrix Set Standard**, which enables multi-resolution representations of gridded data. - - **GDAL geotransform metadata**, used to express affine transformations and interpolation characteristics. - - **SpatioTemporal Asset Catalog (STAC)** metadata elements for resource discovery and cataloguing (Collection and Item constructs). +* to **clarify the role of the Common Data Model (CDM)** as the structural foundation (recognising its suitability for Zarr as demonstrated by *xarray*, *GDAL*, and *nczarr*); +* to **extend the CDM** with additional geospatial capabilities required for cloud-native and tiled data representations, including *affine transformations* and *multiscale overviews*; +* to **ensure compatibility** with established data models and conventions such as *netCDF*, *CF*, *GDAL*, and *GeoTIFF*. -The unified model is format-agnostic and describes the abstract structure of resources independently of the physical encoding. It does not redefine the semantics of the CDM or CF conventions, but introduces integration and extension points required to support tiled multiscale data, geospatial referencing, and metadata for discovery. -This clause specifies the logical composition of the unified model, the external standards it leverages, and the conformance points that facilitate harmonised implementation within the GeoZarr framework. +=== Conceptual Basis -=== Foundational Model and Standards Reuse +The GeoZarr Data Model adopts and extends the **Unidata Common Data Model (CDM)** as its conceptual foundation. +The CDM defines a hierarchy of *groups*, *variables*, *dimensions*, and *attributes* that together describe the logical organisation of scientific data. -The unified data model described in this Standard is derived from established community specifications to maximise interoperability and to enable the reuse of mature tools and practices. The model is grounded in the Unidata Common Data Model (CDM) and the Climate and Forecast (CF) Conventions, which together provide a robust framework for representing scientific and geospatial datasets. +GeoZarr reuses these constructs and introduces an additional layer of **GeoZarr Extensions** that provide explicit geospatial semantics and support for cloud-native scalability: + ** *Affine transformations* defining spatial reference through linear mapping between array indices and real-world coordinates; + ** *Overviews* enabling multiscale representations and efficient visualization of large datasets. -==== Common Data Model (CDM) - -The CDM defines a generalised schema for representing array-based scientific datasets. The following constructs are reused directly within the unified model: - -- **Dimensions** – Integer-valued, named axes that define the extents of data variables. -- **Coordinate Variables** – Variables that supply coordinate values along dimensions, establishing spatial or temporal context. -- **Data Variables** – Multidimensional arrays representing observed or simulated phenomena, associated with dimensions and coordinate variables. -- **Attributes** – Key-value metadata elements used to describe variables and datasets semantically. -- **Groups** – Optional hierarchical containers enabling logical organisation of resources and metadata. - -The unified data model adopts these CDM components without modification excluding the user-defined types. Semantic interpretation remains consistent with the original CDM specification. GeoZarr structures are mapped to CDM constructs to ensure compatibility and clarity. - -==== CF Conventions - -The CF Conventions specify standardised metadata attributes and practices to describe spatiotemporal context within CDM-compliant datasets. These conventions support consistent interpretation of: - -- Coordinate systems -- Grid mappings -- Physical units -- Standard variable naming - -The unified data model supports CF-compliant metadata, including attributes such as `standard_name`, `units`, and `grid_mapping`. The unified data model does not prescribe CF compliance but enables it through permissive design. Partial adoption of CF attributes is supported, and non-compliant datasets may selectively adopt CF metadata as needed. - -==== Standards-Based Extensions - -To support additional capabilities, the model defines optional extension points referencing external OGC and community standards: - -- **OGC Tile Matrix Set** – Facilitates the definition of multiscale grid hierarchies for raster overviews. -- **GDAL Geotransform** – Enables geospatial referencing through affine transformations and optional interpolation specifications. -- **STAC Metadata (Collection and Item)** – Provides linkage to SpatioTemporal Asset Catalogs for resource discovery and indexing. - -These extensions are integrated in a modular fashion and do not alter the core semantics of the CDM or CF structures. Implementations may selectively adopt these extensions based on their application requirements. - -=== Model Extension Points - -The unified data model specifies a series of optional, standards-aligned extension points to support functionality beyond the base CDM and CF constructs. These extensions enhance applicability to Earth observation and spatial analysis use cases without imposing additional mandatory requirements. - -Each extension is defined as an independent module. Implementation of any given extension does not necessitate support for others. - -==== Multi-Resolution Overviews (OGC Tile Matrix Set) - -Support for multi-resolution imagery is enabled via integration with the OGC Tile Matrix Set Standard: - -- Tile matrix sets define spatial tiling schemes with consistent resolutions and coordinate reference systems across zoom levels. -- Overviews may be represented as separate Zarr arrays or groups, each aligned to a specific tile matrix level. -- Metadata includes identifiers for tile matrices, spatial resolution, and spatial alignment. - -This approach aligns with the OGC API – Tiles and enables efficient access to large gridded datasets. - -==== GeoTransform Metadata (GDAL Interpolation and Affine Transform) - -Geospatial referencing can be further refined through the inclusion of metadata consistent with GDAL conventions: +=== Format Independence -- Affine transformation is specified via the `GeoTransform` attribute or equivalent structures. -- Interpolation methods may be declared to indicate sampling behaviour or sub-pixel alignment strategies. +Although the model was defined to support encoding in **Zarr**, it remains **format-agnostic** at the conceptual level. +Implementations may serialise the same GeoZarr Data Model structure into other compatible encodings, such as NetCDF or alternative object-based formats, provided they preserve the semantics and conformance requirements defined herein. -This extension augments CF grid mappings by providing precise control over grid placement and coordinate transformations. +This separation between **conceptual model** and **physical encoding** ensures that GeoZarr can evolve alongside emerging storage technologies while maintaining interoperability with existing CDM- and CF-based infrastructures. -==== STAC Collection and Item Integration +=== Conceptual Layers -To enable discovery of resources within the hierarchical structure of the data model, this Standard supports the inclusion of STAC metadata elements at appropriate locations within the group hierarchy. +The GeoZarr Data Model is organised into two conceptual layers: -A STAC extension consists of embedding or referencing STAC Collection and Item metadata within the data model: +1. the **Common Data Model (CDM)** – structural foundation for multidimensional data; +2. the **GeoZarr Extensions** – additional constructs for geospatial semantics and multiscale representations. -* Each dataset resource MAY reference a corresponding STAC `Collection` or `Item` using an identifier or embedded object. -* STAC properties such as `datetime`, `bbox`, and `eo:bands` MAY be included in the metadata to enable spatial, temporal, and spectral filtering. -* The structure is compatible with external STAC APIs and metadata harvesting systems. -STAC integration is non-intrusive and modular. It does not impose changes on the internal organisation of datasets and MAY be adopted incrementally by implementations requiring catalogue-based discovery capabilities. - - -==== Modularity and Interoperability - -Each extension point is specified independently. Implementations may advertise support for one or more extensions by declaring conformance to corresponding extension modules. This modularity facilitates incremental adoption, promotes reuse, and enhances interoperability across varied implementation environments. - - -=== Unified Model Structure - -This clause defines the structural organisation of datasets conforming to the unified data model (UDM). It consolidates the foundational elements and optional extensions into a coherent architecture suitable for Zarr encoding, while remaining format-agnostic. The model establishes a modular and extensible framework that supports structured representation of multidimensional, geospatially-referenced resources. - -The model represents datasets as abstract compositions of dimensions, coordinate variables, data variables, and associated metadata. This abstraction ensures that applications and services can reason about the content and semantics of a dataset without reliance on storage layout or specific serialisation. - -==== Dataset Structure - -A dataset conforming to the Unified Data Model (UDM) is structured as a hierarchy rooted at a top-level dataset entity. This design enables modularity and facilitates the representation of complex, multi-resolution, or thematically partitioned data collections. - -Each dataset node comprises the following core components, aligned with the Unidata Common Data Model (CDM) and Climate and Forecast (CF) Conventions: - -- **Dimensions** – Named, integer-valued axes defining the extent of data variables. Examples include `time`, `x`, `y`, and `band`. -- **Coordinate Variables** – Arrays that supply coordinate values along dimensions, providing spatial, temporal, or contextual referencing. These may be scalar or higher-dimensional, depending on the referencing scheme. -- **Data Variables** – Multidimensional arrays representing physical measurements or derived products. Defined over one or more dimensions, these variables are associated with coordinate variables and annotated with metadata. -- **Attributes** – Key-value pairs attached to variables or dataset components. Attributes convey semantic information such as units, standard names, and geospatial metadata. - -The hierarchy is implemented through **groups**, which function as containers for variables, dimensions, and metadata. Groups may define local context while inheriting attributes from parent nodes. This supports the logical subdivision of datasets by theme, resolution, or processing stage, and enhances the clarity and reusability of complex geospatial structures. - -The diagram below represents the structural layer of the unified data model, derived from the Unidata Common Data Model, which serves as the foundational framework for supporting all overlaying model layer. - -//image::udm-core.png[] - -//ifdef::never-shown[] -//Note: Hide until plantuml is supported -.Conformance-class model -[plantuml, cdm_model, svg, opts="debug"] -.... -@startuml CDM_DAL_Object_Model +==== Common Data Model (CDM) -class Dataset { - + String location - + open() - + close() -} +The **Unidata Common Data Model (CDM)** defines the logical structure of scientific datasets through a hierarchy of **Groups**, **Variables**, **Dimensions**, and **Attributes**. +It provides the foundation upon which GeoZarr and many existing libraries (such as *xarray*, *GDAL*, and *nczarr*) operate. +[plantuml, target="cdm_structure_overview", format=svg] +---- +@startuml class Group { - + String name - + List subgroups - + List variables - + List dimensions - + List attributes -} - -class Dimension { - + String name - + int length - + boolean isUnlimited - + boolean isShared } class Variable { - + String name - + DataType dataType - + List shape - + List attributes - + read() } -class DataType { - + String name - <> +class Dimension { } class Attribute { - + String name - + String type - + List values } -Dataset --> Group : rootGroup -Group --> Group : contains > -Group --> Variable : contains > -Group --> Dimension : defines > -Group --> Attribute : has > -Variable --> Dimension : uses > -Variable --> DataType : has > -Variable --> Attribute : has > -@enduml -.... -//endif::never-shown[] - -Note that, conceptually, node within this hierarchy might be treated as a self-contained dataset. - -==== Coordinate Referencing - -Coordinate systems are defined using: - -- **CF Conventions** – Including attributes such as `standard_name`, `units`, `axis`, and `grid_mapping` to express spatiotemporal semantics and coordinate system properties. -- **Affine Transformation Extensions** – Optional support for georeferencing via affine transforms and interpolation metadata (e.g., as defined in GDAL practices), providing enhanced flexibility for irregular grids and grid-aligned imagery. - -The model accommodates both standard CF-compatible definitions and extended referencing mechanisms to support use cases that span scientific analysis and geospatial mapping. - -==== Metadata Integration - -Metadata may be declared at various levels within the model structure: - -- **Global Metadata** – Attributes describing the dataset as a whole, including elements such as `title`, `summary`, and `license`. -- **Variable Metadata** – Attributes associated with individual data or coordinate variables, conveying descriptive or semantic information. -- **Extension Metadata** – Structured metadata linked to optional model extensions (e.g., multiscale tiling, catalogue references, geotransform properties). - -All metadata follows harmonised naming and semantics consistent with the CDM and CF standards, enabling machine and human interpretability while supporting metadata exchange across diverse systems. - -==== Overviews - -The *Overviews* construct defines a formal, interoperable abstraction for multiscale gridded data. It ensures structural consistency across zoom levels and provides a semantic model for integration with tiled representations such as GeoTIFF overviews, OGC API – Tiles, and STAC Tiled Assets. - -===== Purpose - -The *Overviews* construct provides a general mechanism for associating a single logical data variable with a collection of resampled representations, referred to as *zoom levels*. Each zoom level holds a reduced-resolution version of the original variable, with progressively decreasing spatial resolution from the base (highest detail) to the coarsest level. - -Overviews enable: - -- Fast access to summary representations for visualisation -- Progressive transmission and downsampling -- Multi-resolution analytics and adaptive processing - -===== Conceptual Structure - -An *Overviews* construct is defined as a *hierarchical set of multiscale representations* of one or more data variables. It comprises the following components: - -[horizontal] -*Base Variable*:: The original, highest-resolution variable to which the overview hierarchy is anchored. It is defined using the standard `DataVariable` structure in the model. -*Overview Levels*:: A sequence of variables representing the same logical quantity as the base variable, but sampled at coarser spatial resolutions. -*Zoom Level Identifier*:: A unique identifier associated with each level, ordered from finest (e.g. `"0"`) to coarsest resolution (e.g. `"N"`). -*Tile Grid Definition*:: A mapping that associates each zoom level with a spatial tiling layout, defined in alignment with a `TileMatrixSet`. -*Spatial Alignment*:: Each overview variable MUST be spatially aligned with the base variable using a consistent coordinate reference system and compatible axis orientation. -*Resampling Method*:: A declared method indicating the technique used to derive coarser levels from the base variable (e.g. `nearest`, `average`, `cubic`). - -===== Model Components - -The *Overviews* construct is represented in the unified data model using the following logical elements: - -[cols="1,3"] -|=== -|Element |Definition - -|`OverviewSet` | A logical grouping of variables at multiple zoom levels associated with a single base variable. - -|`OverviewLevel` | A single resampled variable at a specific resolution, identified by a zoom level string. - -|`TileMatrixSetRef` | A reference to the tile grid specification applied across all overview levels. May refer to a well-known identifier, a URI, or an inline object. - -|`TileMatrixLimits` | (Optional) Constraints on the tile coverage per zoom level. - -|`resampling_method` | A string indicating the uniform method used to downsample data across all levels. -|=== - -All overview levels MUST preserve: - -- The data variable’s semantic identity (`standard_name`, `units`, etc.) -- The coordinate reference system -- The axis order and dimension semantics - -Only the resolution and extent (through tiling and shape) may differ across levels. - -===== Relationship to Tile Matrix Set +class DataArray { +} -The *Overviews* construct is structurally aligned with the OGC Tile Matrix Set concept. Each zoom level is mapped to a `TileMatrix`, and the chunk layout for the corresponding data variable SHALL match the tile grid’s `tileWidth` and `tileHeight`. +class CoordinateVariable { +} -The `OverviewSet` MAY constrain tile matrix limits using `TileMatrixSetLimits`, which restrict tile indices to actual data coverage, consistent with the spatial extent of the overview variable. +class Subgroup { +} -===== Usage Context +Group "1" *-- "0..*" Variable +Group "1" *-- "0..*" Dimension +Group "1" *-- "0..*" Attribute +Group "1" *-- "0..*" Subgroup +Variable "1" *-- "1" DataArray +Variable <|-- CoordinateVariable -The *Overviews* construct is applicable to any gridded data variable with at least two spatial dimensions. It is primarily designed for: +@enduml +---- -- Raster imagery (e.g. reflectance, temperature) -- Data cubes with spatial slices (e.g. time-series of spatial grids) -- Multi-band products with consistent spatial structure across levels -The structure may be extended for N-dimensional datasets in future revisions, provided that two spatial axes can be unambiguously identified. +* A **Group** is a container that may include variables, dimensions, attributes, and subgroups. +* A **Variable** represents a multidimensional array associated with one or more dimensions and attributes. +* A **Dimension** defines an index axis used to organise data within variables. +* An **Attribute** holds descriptive metadata for groups or variables. +* A **Coordinate Variables** supplies coordinate values along dimensions, establishing spatial or temporal context. +* A **Data Array** represents observed or simulated phenomena, associated with dimensions and coordinate variables. -=== Conformance and Extensibility +This structure enables consistent representation of scientific data independently of storage format, providing the base semantic framework for all GeoZarr encodings. -The GeoZarr data model is designed with an open conformance approach to support a wide range of use cases and implementation contexts. Its core model is permissive, allowing partial implementations, while optional extensions and compliance profiles can define stricter requirements for interoperability. -==== Core Conformance +==== GeoZarr Extensions -- Datasets conforming to the core model must: -* Represent data using CDM-compatible constructs (dimensions, variables, attributes). -* Follow attribute conventions where applicable. -* Be parsable as valid Zarr with structured metadata following this specification. +GeoZarr extends the CDM with additional geospatial constructs required for cloud-native applications: -- CF compliance is not mandatory but is recommended for semantic interoperability. +* **Affine transformations** — define the mapping between array indices and real-world coordinates using linear coefficients. + This enables compact georeferencing for regularly gridded data. +* **Multiscale overviews** — represent downsampled versions of variables for efficient visualisation and scalable access. + Overviews are structured as subordinate variable groups sharing the same coordinate system. -==== Extension Conformance +All extensions remain aligned with the CDM hierarchy and are encoded using the same core constructs (groups, variables, and attributes). +Together, they provide the minimal geospatial extensions necessary for efficient, standards-based representation of Earth observation and scientific data in cloud environments. -- Implementations may optionally support one or more extension modules: -* Multi-resolution overviews (Tile Matrix Set) -* GeoTransform metadata (GDAL) -* STAC metadata integration -- Each extension defines its own requirement class with validation rules and expected metadata structures. +[plantuml, target="geozarr_extension_overview", format=svg] +---- +@startuml +skinparam classAttributeIconSize 0 +skinparam linetype ortho +skinparam packageStyle rectangle +skinparam backgroundColor #FFFFFF +title GeoZarr Extensions – Geospatial Enhancements -- Tools may advertise which extensions they support and validate datasets accordingly. +package "Common Data Model (CDM)" { + class Group + class Variable + class Dimension + class Attribute +} -==== Conformance Classes +package "GeoZarr Extensions" { + class AffineTransform + class Overview +} -- Conformance Classes may be defined to specify required components and extensions for specific application domains (e.g., visualisation clients, EO archives, catalogue indexing). -- Conformance Classes enable selective validation without constraining the general model. +Variable --> AffineTransform : georeference +Group --> Overview : provides +Variable --> Overview : provides -==== Extensibility Principles +@enduml +---- -- All extensions must preserve compatibility with the core model and avoid redefining existing CDM or CF semantics. -- New extensions should be documented with clear identifiers, schemas, and conformance criteria. -- The model encourages interoperability by allowing tools to interpret unknown extensions without failure. -This extensibility framework supports both minimum-viable use and high-fidelity metadata integration, enabling incremental adoption across the geospatial and scientific data communities. +// include::clause_7_part_overviews.adoc[] -=== Interoperability Considerations -Interoperability is a core objective of the GeoZarr unified data model. The model is designed to bridge diverse Earth observation and scientific data ecosystems by enabling structural and semantic compatibility with established formats and standards, while providing a forward-looking foundation for scalable, cloud-native workflows. +=== Interoperability with Other Frameworks -This section outlines the principles and mechanisms supporting interoperability across formats, tools, and communities. +The Common Data Model (CDM), with its flexible hierarchy of groups, variables, dimensions, and attributes allows direct representation of metadata constructs used across multiple scientific and geospatial standards. -==== Format Mapping and Alignment +This design enables the CDM—and therefore GeoZarr—to act as a *host model* for conventions and metadata originating from other frameworks, while preserving their semantics within a unified structure. -The data model is explicitly aligned with foundational standards including the Unidata Common Data Model (CDM), the CF Conventions, and established practices in formats such as NetCDF and GeoTIFF. Where applicable, GeoZarr datasets may be derived from or transformed into these formats using consistent mappings. +* **netCDF and the Enhanced Data Model** – +The netCDF Enhanced Data Model and the CDM share common origins and are conceptually aligned. +Both organise data into variables, dimensions, and attributes. +As a result, most netCDF datasets can be represented as CDM hierarchies without loss of structure or metadata. +Conversely, GeoZarr datasets that follow the CDM pattern can be serialised as valid netCDF encodings. -- *NetCDF (classic and enhanced models)*: -* GeoZarr shares a common conceptual structure with NetCDF via CDM. -* Variables, dimensions, coordinate systems, and attributes follow directly mappable patterns. -* Metadata expressed in CF conventions in NetCDF can be preserved in GeoZarr without loss of fidelity. +* **CF Conventions** – +The CF data model is encoding independent but conceptually compatible with the CDM. +CF metadata constructs—such as coordinate and auxiliary coordinate variables, standard names, units, and grid mappings—map directly onto CDM variables and attributes. +This allows GeoZarr datasets to incorporate CF semantics naturally, achieving partial or full CF compliance without modifying the underlying data model. -- *GeoTIFF*: -* Raster-based datasets in GeoZarr can map to GeoTIFF by interpreting spatial referencing (via CF or GeoTransform) and band structures. -* Overviews aligned to OGC Tile Matrix Sets may correspond to TIFF image pyramids. -* Projection metadata and resolution information can be mapped via standard tags. +* **GDAL Metadata and Geotransform** – +GDAL expresses georeferencing through affine transformation coefficients and projection information. +These map directly to GeoZarr extension attributes (for affine transforms and CRS) stored as CDM attributes. +GDAL domain metadata can likewise be represented as CDM attributes within groups or variables, maintaining equivalence between GDAL and GeoZarr geospatial metadata. -These mappings facilitate round-trip transformations and enable toolchains that consume or produce multiple formats without reengineering semantic models. +* **GeoTIFF Tags and Metadata** – +GeoTIFF georeferencing information, including coordinate reference system definitions, tie points, and pixel scale, correspond closely to the affine transform and CRS constructs in the GeoZarr Extensions. +These elements can be represented as attributes within CDM-compliant groups and variables, ensuring semantic consistency between file-based and cloud-native representations. -==== Semantic Interoperability +Through these mappings, the CDM acts as a **common semantic framework** that integrates metadata from diverse geospatial standards. +This interoperability ensures that GeoZarr can serve as both a native storage model and a bridge between existing ecosystems such as netCDF/CF, GDAL, and GeoTIFF. -Semantic interoperability is supported through adherence to CF conventions, use of standardised attribute names (e.g., `standard_name`, `units`), and alignment with metadata vocabularies used in other ecosystems (e.g., STAC, EPSG codes, ISO 19115 keywords). +[NOTE] +==== +GeoZarr does **not** define the mappings to CDM for metadata from existing conventions or formats such as CF, netCDF, GDAL, or GeoTIFF. +These mappings are already established and maintained by widely used libraries and implementations, including *xarray*, *netCDF-Java*, *GDAL*, etc. The role of GeoZarr is to provide a **data model and encoding framework**, not to redefine or replicate existing translation logic between metadata standards. -The model does not prescribe specific vocabularies beyond CF but encourages reuse and recognition of widely accepted descriptors to promote cross-domain understanding. +Accordingly, the **GeoZarr encoding specification** will only prescribe additional rules where a specific encoding behaviour in **Zarr** is required for interoperability or conformance. +==== ==== Metadata and Discovery Integration @@ -344,11 +178,10 @@ This approach enables seamless integration into modern data catalogues and platf ==== Tool and Ecosystem Support -The unified data model facilitates interoperability with tools and libraries across the following domains: +The Unified Data Model facilitates interoperability with tools and libraries across the following domains: - *Scientific computing*: NetCDF-based libraries (e.g., xarray, netCDF4), Zarr-compatible clients. - *Geospatial processing*: GDAL, rasterio, QGIS (via Zarr driver extensions or translations). - *Cloud-native infrastructure*: support for parallel access, chunked storage, and hierarchical grouping compatible with object storage. Tooling support is expected to grow via standard-conformant implementations, easing adoption across domains and infrastructures. - diff --git a/standard/template/sections/clause_9_zarr_encoding.adoc b/standard/template/sections/clause_9_zarr_encoding.adoc index d62edec..3de6155 100644 --- a/standard/template/sections/clause_9_zarr_encoding.adoc +++ b/standard/template/sections/clause_9_zarr_encoding.adoc @@ -1,9 +1,12 @@ -== Unified Data Model Encoding for Zarr +== Encodings for Zarr -This clause defines the encoding of the unified data model into the Zarr format. The encoding supports both Zarr Version 2 and Zarr Version 3. +This clause defines the normative mapping between the **GeoZarr Data Model** and the **Zarr storage format**. +It specifies how the structural elements of the **Common Data Model (CDM)** - groups, variables, dimensions, and attributes — are encoded in **Zarr v2** and **Zarr v3**, and identifies additional constraints introduced by GeoZarr. -TIP: This is a very preliminary draft. The content is primarily for demonstrating the purpose of the proposed sections. +GeoZarr’s encoding rules are limited to cases where explicit guidance is required for interoperability. +GeoZarr does **not** redefine how CF, GDAL, or other metadata conventions map to CDM constructs — these mappings are already implemented in community libraries such as *xarray*, *GDAL*, and *netCDF-Java*. +The GeoZarr encoding rules therefore focus on **CDM structure and semantics**, with additional subsections specifying any **Zarr-specific requirements** for supported metadata conventions. include::clause_9_zarr_encoding_core.adoc[] diff --git a/standard/template/sections/clause_9_zarr_encoding_core.adoc b/standard/template/sections/clause_9_zarr_encoding_core.adoc index a2d6a2e..44683f5 100644 --- a/standard/template/sections/clause_9_zarr_encoding_core.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_core.adoc @@ -1,39 +1,36 @@ -=== Hierarchical Structure +=== Common Data Model Encodings -A dataset conforming to the unified data model is represented as a hierarchical structure of groups, variables (arrays), dimensions, and metadata. The dataset is rooted in a *top-level group*, which may contain: - -- Arrays representing coordinate or data variables -- Child groups for modular organisation, including logical sub-collections or resolution levels -- Metadata attributes at group and array levels - -Each group adheres to a consistent structure, allowing recursive composition. This reflects the CDM's use of *groups* and is supported by both Zarr v2 and v3 with differing implementations. +==== Hierarchical Structure +A GeoZarr hierarchy follows the CDM model of a tree of **groups**, **variables** (arrays), **dimensions**, and **attributes**. +Each Zarr store contains a single root group and an arbitrary number of child groups and arrays, organised recursively. [cols="1,2,2"] |=== -|Model Element |Zarr v2 Encoding |Zarr v3 Encoding - -|Root Dataset | Directory with `.zgroup` and `.zattrs` | Directory with `zarr.json`, with `node_type: group` - -|Child Group | Subdirectory with `.zgroup` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: group` - -|Array | Subdirectory with `.zarray` and `.zattrs` | Subdirectory with `zarr.json`, with `node_type: array` +|CDM Element |Zarr v2 Encoding |Zarr v3 Encoding -|Attributes | `.zattrs` file | `attributes` field in `zarr.json` +|Group | Directory with `.zgroup` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "group"` +|Variable (Array) | Directory with `.zarray` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "array"` +|Attributes | `.zattrs` file (JSON object) | `attributes` field in `zarr.json` |=== -Zarr v3 requires `zarr_format: 3` and stores all metadata (including user-defined attributes) in the `zarr.json` document. Each node includes a `node_type` field: either `"group"` or `"array"`. +Zarr v3 nodes must declare `"zarr_format": 3` and include `"node_type"` set to either `"group"` or `"array"`. +All user-defined metadata, including GeoZarr attributes, shall be placed within the `attributes` field. -=== Dimensions +==== Dimensions -Dimensions define the axes along which variables are indexed. +Dimensions define the index axes for variables. -- In Zarr v2, dimensions are inferred from array shape and declared in `_ARRAY_DIMENSIONS` within `.zattrs`. -- In Zarr v3, dimensions are stored using the `dimension_names` field in `zarr.json`. +[cols="1,2,2"] +|=== +|Aspect |Zarr v2 |Zarr v3 -Example for a 2D array with dimension names `["lat", "lon"]`: +|Declaration | `_ARRAY_DIMENSIONS` attribute in `.zattrs` | `dimension_names` field in `zarr.json` +|Scope | Implicit, per array | Explicit, per array; names are globally unique within a group hierarchy +|=== +Example (Zarr v3 array with two dimensions): [source,json] ---- { @@ -45,24 +42,31 @@ Example for a 2D array with dimension names `["lat", "lon"]`: } ---- -=== Coordinate Variables +**Shared Dimensions:** + +Zarr does not define dimension entities as standalone objects. +To preserve CDM semantics, GeoZarr requires that dimension names be **unique within each group hierarchy** and reused consistently across variables that share the same axis. -Coordinate variables (excluding GeoTransform Coordinates) define the geospatial or temporal context of data. They are represented as named arrays with metadata attributes. +As in **netCDF-4**, where groups can define their own local dimensions, GeoZarr allows dimensions to be scoped within groups. +When a dimension defined in one group is shared by variables located in descendant groups, implementations may indicate this relationship by prefixing the dimension name with a slash (e.g., `"/time"`). +In this context, the leading slash signifies that the dimension is defined in an **ancestor group**—not necessarily the root of the hierarchy—and should be interpreted as a shared axis accessible to all subordinate groups. -Coordinate variables are represented as named 1D arrays aligned with corresponding dimensions. + +==== Coordinate Variables + +Coordinate variables define the spatial, temporal, or other contextual axes for data variables. +They are stored as one-dimensional arrays associated with their corresponding dimensions. [cols="1,2,2"] |=== -|Feature |Zarr v2 |Zarr v3 +|Aspect |Zarr v2 |Zarr v3 |Storage | Zarr array with `.zarray`, `.zattrs` | Zarr array with `zarr.json` - |Dimension Binding | `_ARRAY_DIMENSIONS` in `.zattrs` | `dimension_names` in `zarr.json` - -|CF Metadata | `standard_name`, `units`, `axis` in `.zattrs` | Under `attributes` in `zarr.json` +|Metadata | CF-style attributes (e.g., `standard_name`, `units`, `axis`) | Same under `attributes` |=== -Example `zarr.json` for a coordinate array: +Example (Zarr v3 coordinate array): [source,json] ---- { @@ -71,12 +75,6 @@ Example `zarr.json` for a coordinate array: "shape": [180], "dimension_names": ["lat"], "data_type": "float32", - "chunk_grid": { - "name": "regular", - "configuration": { - "chunk_shape": [180] - } - }, "attributes": { "standard_name": "latitude", "units": "degrees_north", @@ -85,51 +83,53 @@ Example `zarr.json` for a coordinate array: } ---- +Coordinate variables may also reference *grid mapping* variables for coordinate reference systems, as defined in the CF conventions. -=== Data Variables +==== Data Variables -Data variables represent measured or derived quantities. They are stored as multidimensional arrays with metadata attributes. +Data variables represent primary measurements or derived quantities. +They are encoded as multidimensional arrays linked to one or more dimensions and accompanied by descriptive metadata. [cols="1,2,2"] |=== -|Feature |Zarr v2 |Zarr v3 +|Aspect |Zarr v2 |Zarr v3 -|Storage | Multidimensional array with `.zarray` and `.zattrs` | Same structure; v3 supports additional chunk storage formats - -|Dimension Association | `_ARRAY_DIMENSIONS` attribute | Same as v2 - -|CF Metadata | `standard_name`, `units`, `long_name`, `_FillValue`, etc. | Same as v2; v3 may support typed attributes +|Storage | Directory containing `.zarray` and `.zattrs` | Directory containing `zarr.json` with `"node_type": "array"` +|Dimension Binding | `_ARRAY_DIMENSIONS` attribute | `dimension_names` field +|Metadata | Attributes such as `standard_name`, `units`, `long_name`, `_FillValue`, `scale_factor`, `add_offset` | Same, with typed attributes permitted in v3 |=== Example: [source,json] ---- { - "_ARRAY_DIMENSIONS": ["time", "lat", "lon"], - "standard_name": "air_temperature", - "units": "K", - "long_name": "Surface air temperature", - "_FillValue": -9999.0 + "zarr_format": 3, + "node_type": "array", + "shape": [12, 180, 360], + "dimension_names": ["time", "lat", "lon"], + "attributes": { + "standard_name": "air_temperature", + "units": "K", + "long_name": "Surface air temperature", + "_FillValue": -9999.0 + } } ---- -=== Global Metadata - -Metadata associated with the dataset as a whole is stored at the root group level. +==== Global and Group Metadata +Metadata applying to the entire hierarchy or subgroup is stored at the group level. [cols="1,2,2"] |=== -|Field |Zarr v2 |Zarr v3 - -|Location | `.zattrs` file of root `.zgroup` | `attributes` field in root `zarr.json` - -|Group Identification | `.zgroup` file | `node_type: group` in `zarr.json` +|Aspect |Zarr v2 |Zarr v3 -|CF Conformance | `Conventions` attribute (e.g., `CF-1.10`) | Same, under `attributes` +|Location | `.zattrs` in root group | `attributes` field in root `zarr.json` +|Identification | `.zgroup` file | `"node_type": "group"` +|Conventions | `Conventions` attribute (e.g., `CF-1.10`) | Same under `attributes` |=== -Example Zarr v3 root `zarr.json`: +Example: [source,json] ---- { @@ -144,17 +144,53 @@ Example Zarr v3 root `zarr.json`: } ---- +==== Variable and Attribute Metadata + +All metadata attributes for groups, coordinate variables, and data variables should follow established community naming and typing conventions. +GeoZarr encourages CF-compliant naming where applicable but does not require it. + +Attributes shall: +- use UTF-8–encoded names; +- have JSON-compatible values (string, number, boolean, or array); +- remain consistent across group hierarchies. + +Typical attributes include: + +* CF: `standard_name`, `units`, `axis`, `grid_mapping` +* Generic: `_FillValue`, `scale_factor`, `add_offset`, `long_name`, `missing_value` +* GDAL-compatible: `spatial_ref`, `GeoTransform`, `AREA_OR_POINT` + +Structured metadata values, such as JSON or XML content, may be included directly as objects rather than as serialised text. Implementations are encouraged to **store such metadata in deserialised form** (as native JSON objects) whenever possible, ensuring that attributes remain machine-readable and conform to JSON type rules. + +If serialised representations (e.g., XML strings or JSON text blocks) are used, they shall be valid UTF-8 strings and clearly identified by attribute naming or context. + + +==== CDM Encoding Notes and Special Cases + +* **Shared Dimensions** – +To emulate the CDM concept of shared dimensions, GeoZarr requires that identical dimension names across arrays refer to the same logical axis. +Libraries implementing GeoZarr should preserve this relationship explicitly in their in-memory representations. + +* **Unlimited Dimensions** – +Zarr’s chunked structure inherently supports extensible dimensions. +A dimension can be declared unlimited by allowing its corresponding array dimension to grow dynamically (e.g., time). +The use of `"resizeable": true` (Zarr v3) or dynamic chunk append operations is recommended. + +* **Nested Groups and Subgroups** – +Zarr v3 groups may nest recursively. +Each subgroup represents a CDM group and may hold its own variables and attributes. +This structure supports logical organisation such as multiple collections, products, or resolution levels. -=== Variables Metadata +==== Metadata Integration for CF, GDAL, and GeoTIFF -All metadata attributes (for groups, coordinates variables and data variables) are recommended to conform to CF naming and typing conventions. Supported attributes include: +While the GeoZarr Data Model provides the structure for metadata storage, **GeoZarr does not redefine how CF, GDAL, or GeoTIFF metadata are mapped into this structure**. +These mappings are well established in community libraries (e.g., *xarray*, *netCDF-Java*, *GDAL*). -- `standard_name`, `units`, `axis`, `grid_mapping` (CF) -- `_FillValue`, `scale_factor`, `add_offset` -- `long_name`, `missing_value` +GeoZarr encoding rules therefore only specify **when a specific Zarr encoding requirement applies**, such as: -In all cases: +* use of `attributes` fields in `zarr.json` for CF or GDAL metadata; +* preservation of key metadata names (`grid_mapping`, `spatial_ref`, `GeoTransform`); +* ensuring metadata values remain valid JSON types. -- Attribute names are case-sensitive and encoded as UTF-8 strings -- Values shall conform to JSON-compatible types (string, number, boolean, array) +Implementations may rely on existing libraries to populate or interpret such metadata consistently. diff --git a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc index b20092e..c367b6d 100644 --- a/standard/template/sections/clause_9_zarr_encoding_overviews.adoc +++ b/standard/template/sections/clause_9_zarr_encoding_overviews.adoc @@ -1,30 +1,30 @@ === Encoding of Multiscale Overviews in Zarr -This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr-based datasets conforming to the unified data model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. +This clause specifies how multiscale tiling (also known as overviews or pyramids) is encoded in Zarr hierarchies conforming to the Unified Data Model. The encoding supports both Zarr Version 2 and Version 3 and is aligned with the OGC Two Dimensional Tile Matrix Set Standard. -Multiscale datasets are composed of a set of Zarr groups representing multiple zoom levels. Each level stores coarser-resolution resampled versions of the original data variables. +A <> contains one or more child groups, where each child group is a <> representing a zoom level of the data. Additional resolution levels can be added over time, with each new level storing a coarser-resolution resampled version of the original data variables. ==== Hierarchical Layout -Each zoom level SHALL be represented as a Zarr group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These groups SHALL be organised hierarchically under a common multiscale root group. Each zoom-level group SHALL contain the complete set of variables (Zarr arrays) corresponding to that resolution. +Each zoom level SHALL be represented as a child group, identified by the Tile Matrix identifier (e.g., `"0"`, `"1"`, `"2"`). These child groups SHALL be organized hierarchically under a common multiscale group and each SHALL be a <> containing the complete set of variables (arrays) corresponding to that resolution. All zoom-level datasets MUST maintain consistent structure. [cols="1,2,2"] |=== |Structure |Zarr v2 |Zarr v3 -|Zoom level groups | Subdirectories with `.zgroup` and `.zattrs` | Subdirectories with `zarr.json`, `node_type: group` +|Zoom level datasets | Subdirectories with `.zgroup` and `.zattrs` | Subdirectories with `zarr.json`, `node_type: group` -|Variables at each level | Zarr arrays (`.zarray`, `.zattrs`) in each group | Zarr arrays (`zarr.json`, `node_type: array`) in each group +|Variables at each level | Arrays (`.zarray`, `.zattrs`) in each dataset | Arrays (`zarr.json`, `node_type: array`) in each dataset -|Global metadata | `multiscales` defined in parent `.zattrs` | `multiscales` defined in parent group `zarr.json` under `attributes` +|Multiscale metadata | `multiscales` defined in multiscale group `.zattrs` | `multiscales` defined in multiscale group `zarr.json` under `attributes` |=== -Each multiscale group MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. +Each zoom-level dataset MUST define chunking (tiling) along the spatial dimensions (`X`, `Y`, or `lon`, `lat`). Recommended chunk sizes are 256×256 or 512×512. ==== Metadata Encoding -Multiscale metadata SHALL be defined using a `multiscales` attribute located in the parent group of the zoom levels. This attribute SHALL be a JSON object with the following members: +Multiscale metadata SHALL be defined using a `multiscales` attribute located in the multiscale group. This attribute SHALL be a JSON object with the following members: - `tile_matrix_set` – Identifier, URI, or inline JSON object compliant with OGC TileMatrixSet v2 - `resampling_method` – One of the standard string values (e.g., `"nearest"`, `"average"`) @@ -98,4 +98,3 @@ The `resampling_method` MUST indicate the method used for downsampling across zo `nearest`, `average`, `bilinear`, `cubic`, `cubic_spline`, `lanczos`, `mode`, `max`, `min`, `med`, `sum`, `q1`, `q3`, `rms`, `gauss` The same method MUST apply across all levels. -