remove insertion of vlen-string codec for v2 metadata creation #3100

d-v-b · 2025-05-27T08:59:44Z

This PR removes the following code block:

Lines 771 to 777 in 0dd797f

    
           # inject VLenUTF8 for str dtype if not already present 
        
           if np.issubdtype(dtype, np.str_): 
        
               filters = filters or [] 
        
               from numcodecs.vlen import VLenUTF8 
        
               if not any(isinstance(x, VLenUTF8) or x["id"] == "vlen-utf8" for x in filters): 
        
                   filters = list(filters) + [VLenUTF8()]

Neither of the two numpy dtypes that can model variable-length strings are subdtypes of np.str_:

>>> np.issubdtype(np.dtypes.ObjectDType(), np.str_)         
False
>>> np.issubdtype(np.dtypes.StringDType(), np.str_)         
False

so there is no reason to use vlen-string codec here.

This was revealed by regression testing against 2.18. Zarr-python 2.18 cannot read certain arrays generated by zarr-python main because we inserting this codec.

dstansby

👍 , but probably worth adding a test to make sure this change doesn't accidentally get reverted in the future? I guess just creating a fixed-length string array and checking the list of filters would work.

changes/3100.bugfix.rst

Co-authored-by: David Stansby <[email protected]>

d-v-b · 2025-05-27T15:07:45Z

👍 , but probably worth adding a test to make sure this change doesn't accidentally get reverted in the future? I guess just creating a fixed-length string array and checking the list of filters would work.

I don't think we need tests to ensure that deleted code doesn't run. A better way to test this desired behavior end-to-end is with the regression tests in #3099.

Co-authored-by: David Stansby <[email protected]>

dstansby · 2025-05-27T15:58:32Z

Independently of cross testing with zarr-python 2, I don't see why we shouldn't put in a test here (or modify an existing one) that checks the filters for a fixed length string array are as expected.

How does one actually hit this bug? I tried the code below, but the list of filters is empty.

# /// script
# dependencies = [
#   "zarr==3.0.8",
#   "numpy"
# ]
# ///
import numpy as np
import zarr

arr = zarr.create_array(store={}, data=np.array(["a", "b"]))
print(arr.filters)

d-v-b · 2025-05-27T16:10:13Z

run this:

# /// script
# dependencies = [
#   "zarr==3.0.8",
#   "numpy"
# ]
# ///
import numpy as np
import zarr

arr = zarr.create_array(store={}, data=np.array(["a", "b"], zarr_format=2))
print(arr.filters)

I don't see why we shouldn't put in a test here (or modify an existing one) that checks the filters for a fixed length string array are as expected.

We should have somewhere a test that checks that the default filters are empty. In general the default filters should be data type-independent -- I think only variable length strings need a specific filter (and even then, we should not be adding them by default). So I'm not sure this PR needs to contain any new tests.

d-v-b · 2025-05-28T09:23:36Z

for example, here is a test that checks that the default filters are empty:

zarr-python/tests/test_config.py

Line 67 in 0dd797f

"v3_default_filters": {"numeric": [], "string": [], "bytes": []},

I think this is sufficient for this PR. If you feel like another layer of testing is needed, maybe open an issue about that and we can fix it in another PR.

dstansby · 2025-05-28T09:26:09Z

Isn't that just checking the default filters in the config though, not the filters that end up in an array after it's created?

I think this should get a test, so I won't approve/merge, but another dev can merge if they're happy with not adding a test here.

d-v-b · 2025-05-28T09:27:12Z

it doesn't make any sense to me to add a test that specifically checks that the code I deleted has been deleted. Can you find an existing test that we could alter to cover this case?

d-v-b · 2025-05-28T09:29:17Z

fwiw, I do think we have some real gaps in our tests here, so it would be helpful to identify which test needs to be expanded

…into fix/remove-imposed-vlen-string

d-v-b · 2025-05-28T15:34:10Z

@dstansby your request for more tests was super helpful, unfortunately it reveals some problems that might be beyond the scope of this PR 😅

In main, this diff will create a failing test, which passes in this PR:

diff --git a/tests/test_array.py b/tests/test_array.py
index a6bcd17c..c1f3a1eb 100644
--- a/tests/test_array.py
+++ b/tests/test_array.py
@@ -1245,7 +1245,7 @@ class TestCreateArray:
             zarr.create(store=store, dtype="uint8", shape=(10,), zarr_format=3, **kwargs)
 
     @staticmethod
-    @pytest.mark.parametrize("dtype", ["uint8", "float32", "str"])
+    @pytest.mark.parametrize("dtype", ["uint8", "float32", "str", "U10"])
     @pytest.mark.parametrize(
         "compressors",
         [

But while investigating this, I discovered that the logic in these two functions is wrong:

zarr-python/src/zarr/core/metadata/v2.py

Lines 434 to 471 in ada3b22

    
           def _default_compressor( 
        
               dtype: np.dtype[Any], 
        
           ) -> dict[str, JSON] | None: 
        
               """Get the default filters and compressor for a dtype. 
        
               https://numpy.org/doc/2.1/reference/generated/numpy.dtype.kind.html 
        
               """ 
        
               default_compressor = config.get("array.v2_default_compressor") 
        
               if dtype.kind in "biufcmM": 
        
                   dtype_key = "numeric" 
        
               elif dtype.kind in "U": 
        
                   dtype_key = "string" 
        
               elif dtype.kind in "OSV": 
        
                   dtype_key = "bytes" 
        
               else: 
        
                   raise ValueError(f"Unsupported dtype kind {dtype.kind}") 
        
               return cast("dict[str, JSON] | None", default_compressor.get(dtype_key, None)) 
        
           def _default_filters( 
        
               dtype: np.dtype[Any], 
        
           ) -> list[dict[str, JSON]] | None: 
        
               """Get the default filters and compressor for a dtype. 
        
               https://numpy.org/doc/2.1/reference/generated/numpy.dtype.kind.html 
        
               """ 
        
               default_filters = config.get("array.v2_default_filters") 
        
               if dtype.kind in "biufcmM": 
        
                   dtype_key = "numeric" 
        
               elif dtype.kind in "U": 
        
                   dtype_key = "string" 
        
               elif dtype.kind in "OS": 
        
                   dtype_key = "bytes" 
        
               elif dtype.kind == "V": 
        
                   dtype_key = "raw" 
        
               else: 
        
                   raise ValueError(f"Unsupported dtype kind {dtype.kind}")

We should not be encoding fixed-length strings with the vlen-utf8 encoding by default. I also discovered that it's possible to create a vlen string array without the required vlen-utf8 encoding, which raises a runtime error if you try to write values. I think these should be solved in another PR.

remove insertion of vlen-string codec for v2 metadata creation

e2e1f9d

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label May 27, 2025

changelog

33cccc1

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label May 27, 2025

d-v-b requested a review from dstansby May 27, 2025 09:00

dstansby reviewed May 27, 2025

View reviewed changes

changes/3100.bugfix.rst Outdated Show resolved Hide resolved

changes/3100.bugfix.rst Outdated Show resolved Hide resolved

Update changes/3100.bugfix.rst

0597c32

Co-authored-by: David Stansby <[email protected]>

Update changes/3100.bugfix.rst

c2c2f40

Co-authored-by: David Stansby <[email protected]>

d-v-b added 2 commits May 28, 2025 15:17

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

97fa0b5

…into fix/remove-imposed-vlen-string

add dtype strings to test_v2_chunk_encoding

7908f54

dstansby approved these changes May 28, 2025

View reviewed changes

d-v-b merged commit feb4aa2 into zarr-developers:main May 28, 2025
30 checks passed

d-v-b deleted the fix/remove-imposed-vlen-string branch May 29, 2025 06:51

d-v-b mentioned this pull request Jun 13, 2025

String arrays written with zarr python 3 zarr spec 2 cannot be read in zarr python 2 #3132

Closed

msschwartz21 mentioned this pull request Jun 13, 2025

Incompatible dtypes between zarr 3 and zarr 2 live-image-tracking-tools/geff#53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

remove insertion of vlen-string codec for v2 metadata creation #3100

remove insertion of vlen-string codec for v2 metadata creation #3100

Uh oh!

d-v-b commented May 27, 2025

Uh oh!

dstansby left a comment

Uh oh!

Uh oh!

Uh oh!

d-v-b commented May 27, 2025

Uh oh!

dstansby commented May 27, 2025

Uh oh!

d-v-b commented May 27, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

dstansby commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	# inject VLenUTF8 for str dtype if not already present
	if np.issubdtype(dtype, np.str_):
	filters = filters or []
	from numcodecs.vlen import VLenUTF8

	if not any(isinstance(x, VLenUTF8) or x["id"] == "vlen-utf8" for x in filters):
	filters = list(filters) + [VLenUTF8()]

Uh oh!

remove insertion of vlen-string codec for v2 metadata creation #3100

remove insertion of vlen-string codec for v2 metadata creation #3100

Uh oh!

Conversation

d-v-b commented May 27, 2025

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d-v-b commented May 27, 2025

Uh oh!

dstansby commented May 27, 2025

Uh oh!

d-v-b commented May 27, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

dstansby commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025

Uh oh!

d-v-b commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d-v-b commented May 28, 2025 •

edited

Loading