Skip to content

Conversation

@RFLeijenaar
Copy link

The latest release of zarr-python (3.1.3) added extensibility of chunk key encodings. This PR defines a fanout chunk key encoding that splits each chunk coordinate among multiple levels of hierarchy to reduce the number of files per directory. This is particularly useful for arrays with at least one very large dimension, as it could yield an inordinate number of files in a single directory with the default encoding, degraded filesystem performance.

I have created a repository with a Python implementation of this encoding: Zarr-Fanout-CKE.

I am curious what your thoughts are. Some relevant choices for the spec:

  • The name fanout is chosen, referring to the parameterized number of children per node (directory). Alternatives like hierarchical are imo too generic (the default one is also hierarchical), and radix might be too technical (what goes on underneath the hood)
  • max_children is used as the user-defined parameter name over something like base or radix as it is more intuitive to the end-user what this parameter defines. max_entries_per_dir is more descriptive, but the term 'dir' might be too specific? as the CKE could also be used with stores other than standard filesystems. The only thing that might be a bit awkward with this parameterization is that the 'base' is max_children - 1, but again, this might not be relevant to the end user.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

What is the purpose of the final /c --- that seems to just ensure that every chunk is in its own directory, and doesn't reduce the number of entries in the parent directory.

One big issue with the existing default encoding is that lexicographical key order does not match chunk index order. Having key order match chunk index order would allow certain queries, like finding all chunks within a given rectangular region, to be performed more efficiently on storage systems like s3 or gcs that support lexicographical range queries.

Using some sort of Morton code could further improve efficiency depending on access pattern, but just having lexicographical order match chunk index order is a strict improvement over the existing default encoding.

One simple way to achieve that is by zero padding the numbers up to a fixed number of digits, but since resizing is supported it can be difficult to choose a reasonable upper bound that wouldn't result in an excessive number of digits. For example we might want to allow up 2^64-1 chunks but then would have to pad up to 20 digits in base 10.

To avoid that problem we can use a length prefix, e.g.:

0 -> 00_
1 -> 01_1
12 -> 02_12
123 -> 03_123

where the length is zero-padded but the number itself is not.

This by itself works pretty well but if we want to shorten it we could make various changes:

Encode in base-16, which makes the number itself shorter and also means that we can encode up to 2^64-1 with only 8 digits. Therefore the length can be represented in a single digit also:

0 -> 0_
1 -> 1_1
12 -> 1_c
123 -> 2_7b

Note: The _ separator is purely for readability, and not needed for disambiguation. So we could just exclude it:

0 -> 0
1 -> 11
12 -> 1c
123 -> 27b

Prefixing with the length also means it is self-delimiting and therefore we don't need a separator between dimensions:

(0, 0, 0) -> 000
(1, 1, 1) -> 010101
(123, 45678, 9123432435) -> 27b4b26e921fcc87f3

Then we can just insert slashes every N digits, e.g. every 3 digits to limit directories to at most 4096 entries.

27b/4b2/6e9/21f/cc8/7f3

We could also include e.g. dots between components and underscores after the length prefix, and just exclude underscores and dots for the purpose of counting digits when inserting "/":

2_7b./4_b2/6e.9_/21f/cc8/7f3

There are a lot of possible variations. But this would be a nice way to address the existing key ordering issue while also solving your goal of limiting the number of entries per directory.

@RFLeijenaar
Copy link
Author

What is the purpose of the final /c --- that seems to just ensure that every chunk is in its own directory, and doesn't reduce the number of entries in the parent directory.

It actually does reduce the number of entries in the parent directory.
Let's consider the directory d0/1/ with base 10 (max_children 11) where we just add _c to indicate each chunk
(10,) -> d0/1/0_c
(11,) -> d0/1/1_c
...
(19,) -> d0/1/9_c
(100,) -> d0/1/0/0_c
(110,) -> d0/1/1/0_c
...
(190,) -> d0/1/9/0_c

In this case we are at 20 children in the directory d0/1/.
Note that the chunk file c with my proposed encoding does not always end up in its own directory.

One big issue with the existing default encoding is that lexicographical key order does not match chunk index order. Having key order match chunk index order would allow certain queries, like finding all chunks within a given rectangular region, to be performed more efficiently on storage systems like s3 or gcs that support lexicographical range queries.

A lexicographical key order that aligns with the chunk order is not something I have considered, as this CKE tries to solve a different problem. However, I wonder what the uses cases are for listing chunk keys. I believe this is only useful on block sparse data. Then it must also be combined with array queries that involve a large range query (couple chunks wide) over the outer dimension, right? Are there any Zarr back-ends that currently perform a list operation on the storage keys?

That said, I do not necessarily oppose the idea of prepending a length indicator to the dimension indicator. I will look into it.

@jbms
Copy link
Contributor

jbms commented Oct 21, 2025

What is the purpose of the final /c --- that seems to just ensure that every chunk is in its own directory, and doesn't reduce the number of entries in the parent directory.

It actually does reduce the number of entries in the parent directory.
Let's consider the directory d0/1/ with base 10 (max_children 11) where we just add _c to indicate each chunk
(10,) -> d0/1/0_c
(11,) -> d0/1/1_c
...
(19,) -> d0/1/9_c
(100,) -> d0/1/0/0_c
(110,) -> d0/1/1/0_c
...
(190,) -> d0/1/9/0_c

In this case we are at 20 children in the directory d0/1/.
Note that the chunk file c with my proposed encoding does not always end up in its own directory.

I see, thanks for the clarification. I see that a suffix of some sort is necessary with your encoding to avoid a file and directory with the same name.

One big issue with the existing default encoding is that lexicographical key order does not match chunk index order. Having key order match chunk index order would allow certain queries, like finding all chunks within a given rectangular region, to be performed more efficiently on storage systems like s3 or gcs that support lexicographical range queries.

A lexicographical key order that aligns with the chunk order is not something I have considered, as this CKE tries to solve a different problem. However, I wonder what the uses cases are for listing chunk keys. I believe this is only useful on block sparse data. Then it must also be combined with array queries that involve a large range query (couple chunks wide) over the outer dimension, right? Are there any Zarr back-ends that currently perform a list operation on the storage keys?

This is done by tensorstore for the storage statistics API:

https://google.github.io/tensorstore/python/api/tensorstore.TensorStore.storage_statistics.html

That said, I do not necessarily oppose the idea of prepending a length indicator to the dimension indicator. I will look into it.

@RFLeijenaar
Copy link
Author

To adapt the original proposal, I would suggest replacing the dimension indicator d{i}/ with {num_digits}_, where num_digits is the number of digits of the coordinate in the parameterized base. Each digit can be represented as a zero-padded decimal or hexadecimal. The zero-padding length is determined by the base.

Some examples:

Base 100 in decimals:

Coordinates Chunk key
() c
(123,) c/2_01/23
(1234, 5, 67890) c/2_12/34/1_05/3_06/78/90

Base 1000 in decimals:

Coordinates Chunk key
(123, 45678, 9123432435) c/1_123/2_045/678/4_009/123/432/435

Base 4096 in hexadecimal:

Coordinates Chunk key
(123, 45678, 9123432435) c/1_07b/2_00b/26e/3_21f/cc8/7f3

I could set the minimum base (max_children) to something like 16 (this is already very low) and represent the length in hexadecimal to cover 2^64 chunks along a single dim.

In this way we maintain the original parametrization, while maintaining lexicographical key order that aligns with the chunk coordinates order.

What are your thoughts @jbms?

@jbms
Copy link
Contributor

jbms commented Oct 21, 2025

First converting to this parameterized base and then converting each "digit" according to that base to decimal or hexadecimal seems to introduce additional complexity in both the implementation and specification.. You then also have to zero-pad the decimal or hexadecimal representation of each "digit" to maintain lexicographical order. Also I think you no longer strictly adhere to max_children because it can be multiplied by the number of possible distinct values of num_digits.

Also, if the user doesn't specify a base that it is a power of 10 (if decimal representation is used) or 16 (if hexadecimal representation is used), it will be very difficult for humans to interpret.

An advantage of specifying an arbitrary base, rather than just specifying a max number of digits before inserting a slash, is that it allows very precise control over the maximum number of children in a directory. But is there a use case for such precise control? The use cases that occur to me are all just about avoiding excessively large directories while avoiding creating excessively many directories --- for that, it would be sufficient to allow users to specify a limit within a factor of 10 or 16 rather than a precise limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants