Skip to content

remove insertion of vlen-string codec for v2 metadata creation #3100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 28, 2025

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 27, 2025

This PR removes the following code block:

# inject VLenUTF8 for str dtype if not already present
if np.issubdtype(dtype, np.str_):
filters = filters or []
from numcodecs.vlen import VLenUTF8
if not any(isinstance(x, VLenUTF8) or x["id"] == "vlen-utf8" for x in filters):
filters = list(filters) + [VLenUTF8()]

Neither of the two numpy dtypes that can model variable-length strings are subdtypes of np.str_:

>>> np.issubdtype(np.dtypes.ObjectDType(), np.str_)         
False
>>> np.issubdtype(np.dtypes.StringDType(), np.str_)         
False

so there is no reason to use vlen-string codec here.

This was revealed by regression testing against 2.18. Zarr-python 2.18 cannot read certain arrays generated by zarr-python main because we inserting this codec.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label May 27, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label May 27, 2025
@d-v-b d-v-b requested a review from dstansby May 27, 2025 09:00
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 , but probably worth adding a test to make sure this change doesn't accidentally get reverted in the future? I guess just creating a fixed-length string array and checking the list of filters would work.

Co-authored-by: David Stansby <[email protected]>
@d-v-b
Copy link
Contributor Author

d-v-b commented May 27, 2025

👍 , but probably worth adding a test to make sure this change doesn't accidentally get reverted in the future? I guess just creating a fixed-length string array and checking the list of filters would work.

I don't think we need tests to ensure that deleted code doesn't run. A better way to test this desired behavior end-to-end is with the regression tests in #3099.

Co-authored-by: David Stansby <[email protected]>
@dstansby
Copy link
Contributor

Independently of cross testing with zarr-python 2, I don't see why we shouldn't put in a test here (or modify an existing one) that checks the filters for a fixed length string array are as expected.

How does one actually hit this bug? I tried the code below, but the list of filters is empty.

# /// script
# dependencies = [
#   "zarr==3.0.8",
#   "numpy"
# ]
# ///
import numpy as np
import zarr

arr = zarr.create_array(store={}, data=np.array(["a", "b"]))
print(arr.filters)

@d-v-b
Copy link
Contributor Author

d-v-b commented May 27, 2025

run this:

# /// script
# dependencies = [
#   "zarr==3.0.8",
#   "numpy"
# ]
# ///
import numpy as np
import zarr

arr = zarr.create_array(store={}, data=np.array(["a", "b"], zarr_format=2))
print(arr.filters)

I don't see why we shouldn't put in a test here (or modify an existing one) that checks the filters for a fixed length string array are as expected.

We should have somewhere a test that checks that the default filters are empty. In general the default filters should be data type-independent -- I think only variable length strings need a specific filter (and even then, we should not be adding them by default). So I'm not sure this PR needs to contain any new tests.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 28, 2025

for example, here is a test that checks that the default filters are empty:

"v3_default_filters": {"numeric": [], "string": [], "bytes": []},

I think this is sufficient for this PR. If you feel like another layer of testing is needed, maybe open an issue about that and we can fix it in another PR.

@dstansby
Copy link
Contributor

Isn't that just checking the default filters in the config though, not the filters that end up in an array after it's created?

I think this should get a test, so I won't approve/merge, but another dev can merge if they're happy with not adding a test here.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 28, 2025

it doesn't make any sense to me to add a test that specifically checks that the code I deleted has been deleted. Can you find an existing test that we could alter to cover this case?

@d-v-b
Copy link
Contributor Author

d-v-b commented May 28, 2025

fwiw, I do think we have some real gaps in our tests here, so it would be helpful to identify which test needs to be expanded

@d-v-b
Copy link
Contributor Author

d-v-b commented May 28, 2025

@dstansby your request for more tests was super helpful, unfortunately it reveals some problems that might be beyond the scope of this PR 😅

In main, this diff will create a failing test, which passes in this PR:

diff --git a/tests/test_array.py b/tests/test_array.py
index a6bcd17c..c1f3a1eb 100644
--- a/tests/test_array.py
+++ b/tests/test_array.py
@@ -1245,7 +1245,7 @@ class TestCreateArray:
             zarr.create(store=store, dtype="uint8", shape=(10,), zarr_format=3, **kwargs)
 
     @staticmethod
-    @pytest.mark.parametrize("dtype", ["uint8", "float32", "str"])
+    @pytest.mark.parametrize("dtype", ["uint8", "float32", "str", "U10"])
     @pytest.mark.parametrize(
         "compressors",
         [

But while investigating this, I discovered that the logic in these two functions is wrong:

def _default_compressor(
dtype: np.dtype[Any],
) -> dict[str, JSON] | None:
"""Get the default filters and compressor for a dtype.
https://numpy.org/doc/2.1/reference/generated/numpy.dtype.kind.html
"""
default_compressor = config.get("array.v2_default_compressor")
if dtype.kind in "biufcmM":
dtype_key = "numeric"
elif dtype.kind in "U":
dtype_key = "string"
elif dtype.kind in "OSV":
dtype_key = "bytes"
else:
raise ValueError(f"Unsupported dtype kind {dtype.kind}")
return cast("dict[str, JSON] | None", default_compressor.get(dtype_key, None))
def _default_filters(
dtype: np.dtype[Any],
) -> list[dict[str, JSON]] | None:
"""Get the default filters and compressor for a dtype.
https://numpy.org/doc/2.1/reference/generated/numpy.dtype.kind.html
"""
default_filters = config.get("array.v2_default_filters")
if dtype.kind in "biufcmM":
dtype_key = "numeric"
elif dtype.kind in "U":
dtype_key = "string"
elif dtype.kind in "OS":
dtype_key = "bytes"
elif dtype.kind == "V":
dtype_key = "raw"
else:
raise ValueError(f"Unsupported dtype kind {dtype.kind}")

We should not be encoding fixed-length strings with the vlen-utf8 encoding by default. I also discovered that it's possible to create a vlen string array without the required vlen-utf8 encoding, which raises a runtime error if you try to write values. I think these should be solved in another PR.

@d-v-b d-v-b merged commit feb4aa2 into zarr-developers:main May 28, 2025
30 checks passed
@d-v-b d-v-b deleted the fix/remove-imposed-vlen-string branch May 29, 2025 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants