Skip to content

timedelta64 #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 15, 2025
Merged

timedelta64 #12

merged 10 commits into from
May 15, 2025

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 6, 2025

this PR adds timedelta64, based on the data type with the same name defined in numpy.

Zarr v2 deferred to numpy's data type semantics, which means that Zarr v2 users could transparently create arrays using numpy's timedelta64 data type. The data type defined in this PR enables the same usage pattern for zarr v3. This will be valuable for zarr v2 users who intend to migrate their data to zarr v3, or numpy users who want a simple way to store their data using zarr v3 arrays.

Thus, the goal of this PR is not to specify an excellent data type for representing temporal durations. We should evaluate this spec based on how well it captures the semantics already defined by the numpy timedelta64 data type.

partially addresses #11

Comment on lines 85 to 91
## Fill value representation

`timedelta64` fill values are represented as one of:
- a JSON number with no fraction or exponent part that is within the range `[-2^63, 2^63 - 1]`.
- the string `"NaT"`, which denotes the value `NaT`.

> Note: the `NaT` value may optionally be encoded as the JSON number `-9223372036854775808`, i.e., `-2^63`.
Copy link

@rabernat rabernat May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part. Here it seems like the user can configure their own custom fill value? If so, shouldn't fill_value be in configuration? And what would be the use case for that?

Isn't it simpler if we just say that, like numpy, the integer -2^63 represents NaT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I'm trying to say here is that the following two cases are the only acceptable fill values:

"fill_value" : <a JSON integer in the range [-2^63, 2^63]>
"fill_value" : "NaT"

With one degenerate case:

"fill_value": "NaT"

has the same meaning as

"fill_value": -9223372036854775808

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the statement "timedelta64 fill values are represented as one of" is intended to mean "there are two possible forms for the "fill_value" metadata. Maybe I should make this clear. I definitely don't want to convey that users can configure a custom fill value.

@normanrz
Copy link
Member

normanrz commented May 6, 2025

This PR looks good. Are you ready to have it merged or are you still looking for more feedback from the community?

Copy link

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get it now. Thanks for walking me through the fill value stuff.

Comment on lines 45 to 46
| Y | year |
| M | month |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that year and month are super problematic as units because they don't actually have a fixed duration (leap years, variable months). I would hate to see us proliferating data with this encoding into the world. But I guess if the goal is numpy compatibility, we should leave them in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree that the numpy definition is problematic. But I think there's value in a data type that numpy users (or zarr v2 users) can adopt without thinking. We should specify a less problematic, more generally useful datetime data type in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to rename this data type to numpy.timedelta64 to signal the intent that it is only meant for compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.timedelta64 is actually my preferred name, but iirc @rabernat was not a fan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this naming concern affects all the numpy dtypes, we should resolve that conversation in #4.

@jbms
Copy link
Contributor

jbms commented May 6, 2025

For general use I'd suggest a more general "unit" mechanism rather than a data type but this seems reasonable for numpy compatibility.

Note that "year" and "month" still seem like very plausibly useful units even though they can't be precisely converted to seconds --- for example you may have a table listing the ages of people in years, or of children/infants in months. The source data may well not contain any more precise information anyway.

Technically this issue also exists with every other unit because datetime64 excludes leap seconds.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 6, 2025

This PR looks good. Are you ready to have it merged or are you still looking for more feedback from the community?

I think we should keep this open for a few days at a minimum. I'm very open to feedback on certain things (e.g., should it be named timedelta64 or numpy.timedelta64), and there's not really a rush to get this merged from my POV

@d-v-b d-v-b mentioned this pull request May 6, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented May 9, 2025

this data type is now identified as numpy.timedelta64.

@normanrz normanrz merged commit b256e2b into zarr-developers:main May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants