Skip to content

Naming the dtypes from zarr-python#2874 #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jhamman opened this issue Apr 11, 2025 · 5 comments · May be fixed by #5
Open

Naming the dtypes from zarr-python#2874 #4

jhamman opened this issue Apr 11, 2025 · 5 comments · May be fixed by #5

Comments

@jhamman
Copy link
Member

jhamman commented Apr 11, 2025

zarr-developers/zarr-python#2874 is proposing to add a number of new extension datatypes to Zarr-Python. Currently, @d-v-b has named prefixed the dtypes in that PR with numpy.. This issue proposes removing the numpy prefix. Below

Dtype Class Current name in zarr-developers/zarr-python#2874 Proposed dtype extension name
FixedLengthAscii numpy.fixed_length_ascii fixed-length-ascii
FixedLengthBytes numpy.void fixed-length-bytes
FixedLengthUnicode numpy.fixed_length_ucs4 fixed-length-ucs4
DateTime64 numpy.datetime64 datetime64

Notes:

  • The PR also includes a Structured Dtype that I am intentionally leaving out of this issue to keep the scope manageable.
  • adds extensions that zarr-python defines #1 adds the variable length string and bytes dtypes already implemented in Zarr-Python 3

Questions:

cc: @d-v-b, @normanrz, @rabernat

@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

we should also consider the configuration for these data types.

  • the fixed-length data types must all be parametrized by their length in bytes (or bits)
  • the datetime64 dtype is parametrized by the unit, which is a string defined here

@normanrz
Copy link
Member

I think the names are ok. The spec documents should definitely contain the configuration params. The spec doc should also contain information about how these dtypes can be (de)serialized, i.e. what array-bytes codecs supports them and how.

@mkitti
Copy link

mkitti commented Apr 27, 2025

Why is - used in the extension name rather than underscore _ here?

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2025

because datetime64 and timedelta64 are so numpy-specific, I am wondering if we want to make this explicit in the name, e.g. by using the names numpy.datetime64, numpy.timedelta64.

If we ever create a solution for dates that doesn't rely on numpy semantics, we will have to tell people "don't use datetime64, that's actually just for numpy compatibility, use <better datetime data type> instead" which might get tedious and confuse users.

This was referenced May 7, 2025
@rabernat
Copy link

rabernat commented May 7, 2025

I think the most feasible alternative is an arrow timestamp. Both are based on a 64-bit integer primitive type. The main difference is

  • arrow timestamp only allows milliseconds, microseconds, or nanoseconds as units
  • arrow timestamp optionally includes timezone

We could imagine having both numpy.datetime64 and arrow.timestamp in Zarr. So I would not be opposed to the numpy. prefix.

Side note: every time I read about Arrow dtypes I feel a strong feeling that we are reinventing the wheel here in Zarr world. The concepts of "physical layout," "primitive type," and "parametric type" would be very useful for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants