Skip to content

H5json#420

Draft
jreadey wants to merge 51 commits intomasterfrom
h5json
Draft

H5json#420
jreadey wants to merge 51 commits intomasterfrom
h5json

Conversation

@jreadey
Copy link
Member

@jreadey jreadey commented Apr 23, 2025

Use h5json package for typing and objids

Important

Migrated HSDS to use the h5json library for core utilities, restructured utility modules, added support for client-provided object IDs and timestamps, and updated dependencies to require Python 3.10+ with h5json 1.0.0+.

Library Migration and Utility Restructuring

  • h5json Library Integration: Migrated from local utility modules to h5json library for data type, array, object ID, shape, dataset, filter, link, and time utilities across 30+ files.
  • Deleted Utility Modules: Removed hsds/util/idUtil.py, hsds/util/timeUtil.py, hsds/util/hdf5dtype.py, and hsds/util/arrayUtil.py as their functionality is now provided by h5json.
  • New Utility Module: Created hsds/util/nodeUtil.py with node ID generation, partitioning, and datanode URL resolution functions.
  • Updated Imports: Changed all references from local util modules to h5json equivalents (e.g., util.idUtil â�� h5json.objid, util.timeUtil â�� h5json.time_util).

Object ID and Timestamp Handling

  • Client-Provided Object IDs: Added support for creating objects with client-specified IDs in POST_Dataset, POST_Group, POST_Datatype, and related functions in dset_dn.py, group_dn.py, ctype_dn.py, and dset_sn.py.
  • Timestamp Validation: Added max_timestamp_drift configuration parameter to validate client-provided timestamps in attr_dn.py, link_dn.py, and related modules, with fallback to server-generated timestamps when skew exceeds threshold.
  • Deleted Object Tracking: Added logic to check and remove previously deleted object IDs from deleted_ids set when creating new objects with the same ID.

Configuration and Dependencies

  • New Configuration Parameters: Added default_vlen_type_size, predate_maxtime, posix_delay, max_compact_dset_size, and max_timestamp_drift to admin/config/config.yml.
  • Updated Dependencies: Modified pyproject.toml to require Python 3.10+, add h5json 1.0.0+, update numpy to 2.0.0+, and constrain numcodecs to â�¤0.15.1.
  • Removed Python 3.9: Removed Python 3.9 from CI/CD test matrix in .github/workflows/python-package.yml.

API and Function Refactoring

  • Object Creation Functions: Refactored POST_Dataset, POST_Group, and POST_Datatype handlers to support batch creation of multiple objects using new helper functions (createDatasets, createGroups, createDatatypeObjs) and DomainCrawler for writing initial data.
  • Layout Handling: Changed getChunkLayout calls to getChunkDims throughout codebase; moved layout from top-level response to nested under creationProperties.
  • Link Handling: Changed external link field from h5domain to file in link_dn.py, link_sn.py, and servicenode_lib.py; added per-link timestamp validation in PUT_Links.
  • Attribute Initialization: Added support for initializing attributes from request body in POST_Dataset, POST_Group, and POST_Datatype instead of always creating empty objects.

New Functionality

  • PostCrawler Class: Added hsds/post_crawl.py with PostCrawler class for asynchronously creating multiple HDF5 objects with configurable worker count and error handling.
  • Domain Metadata Consolidation: Added getConsolidatedMetaData function in async_lib.py to create consolidated metadata summaries for all objects in a domain.
  • Data Writing: Added put_data method to DomainCrawler for writing one-chunk dataset values; added doPointWrite and doHyperslabWrite functions in dset_lib.py for writing point and hyperslab selections.
  • Domain Objects Retrieval: Added getobjs parameter to getDomainResponse function to optionally return domain objects from S3 summary file.

Bug Fixes and Improvements

  • Typo Fixes: Fixed multiple typos including "coniguous" â�� "contiguous", "seperated" â�� "separated", "heirarchy" â�� "hierarchy", "inital" â�� "initial", and various attribute/link-related typos.
  • Error Handling: Changed error responses from HTTPInternalServerError to HTTPBadRequest for duplicate object IDs and invalid configurations in ctype_dn.py, dset_dn.py, and group_dn.py.
  • Logging Improvements: Added debug logging for request bodies, object creation, and metadata processing; updated log message prefixes for consistency.
  • POSIX Delay Support: Added posix_delay configuration support to fileClient.py for simulating cloud storage latencies in get_object, put_object, and list_keys methods.
  • Version Update: Updated HSDS_VERSION from 0.9.2 to 1.0.0 in basenode.py.

Test Updates

  • New Test Methods: Added tests for client-provided object IDs (testPostDatasetWithId, testPostTypeWithId, testPostWithId), attribute initialization (testPostDatasetWithAttributes, testPostWithAttributes), timestamp handling (testUseTimestamp), and batch creation (testPostMulti, testDatasetPostMulti).
  • Test Refactoring: Updated tests to access layout from creationProperties instead of top-level; removed CHUNK_MIN/CHUNK_MAX constants and moved them to local scope; updated external link tests to use file field instead of h5domain.
  • Removed Tests: Deleted array_util_test.py, hdf5_dtype_test.py, and id_util_test.py as their functionality is now tested through h5json library.
  • Import Updates: Updated test imports to use h5json functions (e.g., createObjId, getFilterItem) instead of local utilities.

This description was created by Ellipsis for 2bafb51. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now depend on hdf5-json to do this testing, it might be a good idea to include hdf5-json's tests as a step in the CI

mattjala
mattjala previously approved these changes May 7, 2025
Copy link
Contributor

@mattjala mattjala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a few minor comments and questions, this is good to go in. I'll try to get the outstanding PRs on hdf5-json reviewed this week so that we can avoid having HSDS depend on a specific branch.

hsds/group_sn.py Outdated
created = link_item["created"]
# allow "pre-dated" attributes if recent enough
predate_max_time = config.get("predate_max_time", default=10.0)
if now - created > predate_max_time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison seems backwards. If I understand correctly, the difference between current time and creation time should need to be under the max time, not above it

@mattjala
Copy link
Contributor

mattjala commented Jun 3, 2025

@jreadey Linter issues are preventing CI from running on this right now

@mattjala
Copy link
Contributor

mattjala commented Sep 9, 2025

@jreadey Should we mark this as draft until it's in a final state for review?

@jreadey jreadey marked this pull request as draft September 10, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants