Skip to content

Epic: Read and write LinkML schema for HDMF (CSRMatrix / base.yaml) #1486

Description

@rly

Background

HDMF defines data standards using a custom schema language (the HDMF Schema Language, or HDMFSL) developed roughly 10 years ago, when no equivalent tooling existed. That custom language and its supporting machinery (schema handling, validation, documentation) carry a real maintenance cost: they create a barrier to entry for new contributors, make it expensive to add schema features that users request, and isolate HDMF-based standards from the broader data-modeling and scientific data ecosystem.

Since then, LinkML has emerged as a modern, widely adopted data modeling language. It is human-readable and machine-processable, integrates with ontologies, supports validation of complex data structures, and interoperates with other schema languages (JSON Schema) and data formats (JSON, YAML, RDF, OWL, SQL). It is used across biomedical projects, such as the NCATS Biomedical Data Translator and the Alliance of Genome Resources, and it is being adopted by the DANDI Archive, BICAN, and openMINDS, with the Allen Institute for Neural Dynamics and OME-Zarr considering adoption. The LinkML Arrays Working Group has added a metamodel for labeled multi-dimensional arrays, which closes the main historical gap between LinkML and HDMF's array-centric modeling.

This epic is the first step in a multi-epic effort to add LinkML support to HDMF. This work is funded by an NIH National Library of Medicine grant 1R03LM014996-01 (PI: Ruebel) and is part of Aim 1 (enhancing interoperability of HDMF data models) of that grant.

This epic makes a small, self-contained type-set (CSRMatrix and the base types) fully bidirectional between HDMFSL and LinkML. Later epics extend the conventions, reader, and writer to the remaining HDMFSL constructs (links, references, compound dtypes, the DynamicTable family, inheritance roll-down) so the rest of the HDMF Common and NWB types are covered, and eventually adopt LinkML as the primary schema language.

Goal

Enable HDMF to both read LinkML schema into the existing HDMF Spec classes (GroupSpec, DatasetSpec, AttributeSpec, etc.) and write those Spec objects back out to LinkML, for CSRMatrix and the base types. Spec objects are HDMF's in-memory Python representation of a schema: the objects the rest of HDMF uses to read and write data. Spec is the in-memory representation in both directions, making it a bidirectional bridge between the two schema languages. The user-facing experience for reading and writing schema stays the same; LinkML is added as an additional on-disk schema language that HDMF can load and emit. Reading lets HDMF consume LinkML schemas; writing lets HDMF's schema API output LinkML and enables migrating existing HDMFSL schemas to LinkML.

Why Spec: HDMF's runtime (ObjectMapper, BuildManager, TypeMap, validator, dynamic container generation) already consumes Spec, and HDMFSL must keep reading into Spec for backwards compatibility. Reading LinkML into Spec therefore makes LinkML schemas drive real data I/O with no runtime changes. Spec is a deliberately transitional hub, not the permanent center: it mirrors HDMFSL one-to-one and cannot represent LinkML features beyond it. Making it the permanent hub would cap LinkML at HDMFSL's expressiveness and defeat the reasons for adopting LinkML, so a future epic moves the runtime off Spec (likely onto the planned Pydantic models).

Approach and sequencing

We are taking the work smallest-example-first, so that the mapping conventions are settled and validated against a tiny, self-contained type before they propagate into the full HDMF Common and NWB schemas. We do reading before writing: reading can be validated against schema HDMF already loads natively (an unambiguous correctness check), the conventions and fixtures it produces are reused by the writer, and the Spec → LinkML → Spec round-trip then confirms the two directions agree.

The first concrete target is CSRMatrix (from sparse.yaml) together with the base types in base.yaml (Data, Container, SimpleMultiContainer). Together they exercise every construct in initial scope: a group data_type_def with is_a, an attribute, typed and untyped datasets, fixed and unspecified array shapes, and inheritance from Container. They are also a good onboarding vehicle for a new contributor learning HDMF and LinkML.

We package these in a minimal test namespace: a namespace.yaml that references only base.yaml and sparse.yaml. This keeps out-of-scope schema files (such as table.yaml's DynamicTable family and resources.yaml) out of the slice, is self-contained (no cross-namespace imports), and lets the reader and writer run through HDMF's real namespace/catalog path rather than loading bare schema files.

Relationship to p2p-ld/nwb-linkml

The p2p-ld/nwb-linkml project already translates NWB/HDMF schema to LinkML and is a valuable reference for mapping conventions and linkml-runtime API usage. We are not reusing its code. It is GPLv3 while HDMF is BSD, so copying is not permitted; it is also a one-way pipeline (NWB → LinkML → its own Pydantic stack) that never reads LinkML back into HDMF Spec classes. The conventions there may not be 100% suitable for the broad use cases and backwards compatibility required for both HDMF and NWB. Therefore, we will review the decisions from nwb-linkml and borrow ideas from it where they fit.

Scope

In scope for this epic (CSRMatrix / base.yaml):

  • Decide and document the HDMF ↔ LinkML mapping conventions, covering both type-level constructs and the namespace level (the namespace declaration, its schema files, metadata, and versioning).
  • Hand-author the LinkML translation of a minimal test namespace (a namespace.yaml referencing only base.yaml and sparse.yaml) and those two schema files.
  • Implement a reader that builds HDMF Spec objects from the LinkML test namespace via linkml-runtime's SchemaView, loading through the namespace/catalog path, behind an optional linkml dependency group.
  • Implement a writer that serializes Spec objects to LinkML, exposed through HDMF's schema export API.
  • Verify the two directions agree via a Spec → LinkML → Spec round-trip.

The conventions and fixtures are bidirectional and serve both the reader and the writer.

Explicitly out of scope for this epic:

  • Converting the full HDMF Common and NWB schemas to LinkML, and migration tooling/documentation for community extensions (later work).
  • Adopting LinkML as the default schema language and deprecating HDMFSL (a later-stage goal of the grant).
  • HDMFSL features not needed for CSRMatrix / base.yaml: links, references, and compound dtypes. (Needed later to cover the full HDMF Common and NWB schemas.)
  • Cross-namespace imports (one namespace importing another). The test namespace is self-contained; this is needed later for namespaces like hdmf-experimental that import hdmf-common.
  • All LinkML features with no HDMFSL equivalent (enums, ontology URIs such as class_uri / slot_uri, rules, conditional/cross-field validation, mixins, abstract classes). Deferred as a deliberate policy: they cannot be represented in Spec, and modeling them there would regrow the custom system this effort is retiring. They become available when a future epic moves the runtime off Spec.

The reader and writer must not add LinkML-shaped fields to Spec. Constructs that Spec cannot represent are preserved as annotations or cause a loud failure, never silently dropped.

Follow-on epics

  • Support the remaining HDMFSL constructs. Extend the conventions, reader, and writer to links, references, compound dtypes, the DynamicTable family, and inheritance roll-down, so the rest of the HDMF Common and NWB types are covered.

Success criteria

  • Mapping conventions are documented and signed off, covering both type-level and namespace-level mapping.
  • HDMF reads the LinkML test namespace into Spec objects that match those loaded from the existing HDMF schema via the namespace/catalog path.
  • HDMF writes those Spec objects back to LinkML matching the fixtures, and Spec → LinkML → Spec round-trips faithfully.
  • The LinkML dependency remains optional; the core install is unaffected.
  • Work is covered by tests consistent with HDMF's existing coverage standards.

Metadata

Metadata

Assignees

Type

No fields configured for Epic.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions