-
-
Notifications
You must be signed in to change notification settings - Fork 331
remove insertion of vlen-string codec for v2 metadata creation #3100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove insertion of vlen-string codec for v2 metadata creation #3100
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 , but probably worth adding a test to make sure this change doesn't accidentally get reverted in the future? I guess just creating a fixed-length string array and checking the list of filters would work.
Co-authored-by: David Stansby <[email protected]>
I don't think we need tests to ensure that deleted code doesn't run. A better way to test this desired behavior end-to-end is with the regression tests in #3099. |
Co-authored-by: David Stansby <[email protected]>
Independently of cross testing with zarr-python 2, I don't see why we shouldn't put in a test here (or modify an existing one) that checks the filters for a fixed length string array are as expected. How does one actually hit this bug? I tried the code below, but the list of filters is empty. # /// script
# dependencies = [
# "zarr==3.0.8",
# "numpy"
# ]
# ///
import numpy as np
import zarr
arr = zarr.create_array(store={}, data=np.array(["a", "b"]))
print(arr.filters) |
run this:
We should have somewhere a test that checks that the default filters are empty. In general the default filters should be data type-independent -- I think only variable length strings need a specific filter (and even then, we should not be adding them by default). So I'm not sure this PR needs to contain any new tests. |
for example, here is a test that checks that the default filters are empty: zarr-python/tests/test_config.py Line 67 in 0dd797f
I think this is sufficient for this PR. If you feel like another layer of testing is needed, maybe open an issue about that and we can fix it in another PR. |
Isn't that just checking the default filters in the config though, not the filters that end up in an array after it's created? I think this should get a test, so I won't approve/merge, but another dev can merge if they're happy with not adding a test here. |
it doesn't make any sense to me to add a test that specifically checks that the code I deleted has been deleted. Can you find an existing test that we could alter to cover this case? |
fwiw, I do think we have some real gaps in our tests here, so it would be helpful to identify which test needs to be expanded |
…into fix/remove-imposed-vlen-string
@dstansby your request for more tests was super helpful, unfortunately it reveals some problems that might be beyond the scope of this PR 😅 In main, this diff will create a failing test, which passes in this PR: diff --git a/tests/test_array.py b/tests/test_array.py
index a6bcd17c..c1f3a1eb 100644
--- a/tests/test_array.py
+++ b/tests/test_array.py
@@ -1245,7 +1245,7 @@ class TestCreateArray:
zarr.create(store=store, dtype="uint8", shape=(10,), zarr_format=3, **kwargs)
@staticmethod
- @pytest.mark.parametrize("dtype", ["uint8", "float32", "str"])
+ @pytest.mark.parametrize("dtype", ["uint8", "float32", "str", "U10"])
@pytest.mark.parametrize(
"compressors",
[ But while investigating this, I discovered that the logic in these two functions is wrong: zarr-python/src/zarr/core/metadata/v2.py Lines 434 to 471 in ada3b22
We should not be encoding fixed-length strings with the vlen-utf8 encoding by default. I also discovered that it's possible to create a vlen string array without the required vlen-utf8 encoding, which raises a runtime error if you try to write values. I think these should be solved in another PR. |
This PR removes the following code block:
zarr-python/src/zarr/core/array.py
Lines 771 to 777 in 0dd797f
Neither of the two numpy dtypes that can model variable-length strings are subdtypes of
np.str_
:so there is no reason to use vlen-string codec here.
This was revealed by regression testing against 2.18. Zarr-python 2.18 cannot read certain arrays generated by zarr-python main because we inserting this codec.