-
Couldn't load subscription status.
- Fork 10
Fanout chunk key encoding #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
What is the purpose of the final One big issue with the existing default encoding is that lexicographical key order does not match chunk index order. Having key order match chunk index order would allow certain queries, like finding all chunks within a given rectangular region, to be performed more efficiently on storage systems like s3 or gcs that support lexicographical range queries. Using some sort of Morton code could further improve efficiency depending on access pattern, but just having lexicographical order match chunk index order is a strict improvement over the existing default encoding. One simple way to achieve that is by zero padding the numbers up to a fixed number of digits, but since resizing is supported it can be difficult to choose a reasonable upper bound that wouldn't result in an excessive number of digits. For example we might want to allow up 2^64-1 chunks but then would have to pad up to 20 digits in base 10. To avoid that problem we can use a length prefix, e.g.: 0 -> 00_ where the length is zero-padded but the number itself is not. This by itself works pretty well but if we want to shorten it we could make various changes: Encode in base-16, which makes the number itself shorter and also means that we can encode up to 2^64-1 with only 8 digits. Therefore the length can be represented in a single digit also: 0 -> 0_ Note: The _ separator is purely for readability, and not needed for disambiguation. So we could just exclude it: 0 -> 0 Prefixing with the length also means it is self-delimiting and therefore we don't need a separator between dimensions: (0, 0, 0) -> 000 Then we can just insert slashes every N digits, e.g. every 3 digits to limit directories to at most 4096 entries. 27b/4b2/6e9/21f/cc8/7f3 We could also include e.g. dots between components and underscores after the length prefix, and just exclude underscores and dots for the purpose of counting digits when inserting "/": 2_7b./4_b2/6e.9_/21f/cc8/7f3 There are a lot of possible variations. But this would be a nice way to address the existing key ordering issue while also solving your goal of limiting the number of entries per directory. |
It actually does reduce the number of entries in the parent directory. In this case we are at
A lexicographical key order that aligns with the chunk order is not something I have considered, as this CKE tries to solve a different problem. However, I wonder what the uses cases are for listing chunk keys. I believe this is only useful on block sparse data. Then it must also be combined with array queries that involve a large range query (couple chunks wide) over the outer dimension, right? Are there any Zarr back-ends that currently perform a list operation on the storage keys? That said, I do not necessarily oppose the idea of prepending a length indicator to the dimension indicator. I will look into it. |
I see, thanks for the clarification. I see that a suffix of some sort is necessary with your encoding to avoid a file and directory with the same name.
This is done by tensorstore for the storage statistics API: https://google.github.io/tensorstore/python/api/tensorstore.TensorStore.storage_statistics.html
|
|
To adapt the original proposal, I would suggest replacing the dimension indicator Some examples:Base
|
| Coordinates | Chunk key |
|---|---|
() |
c |
(123,) |
c/2_01/23 |
(1234, 5, 67890) |
c/2_12/34/1_05/3_06/78/90 |
Base 1000 in decimals:
| Coordinates | Chunk key |
|---|---|
(123, 45678, 9123432435) |
c/1_123/2_045/678/4_009/123/432/435 |
Base 4096 in hexadecimal:
| Coordinates | Chunk key |
|---|---|
(123, 45678, 9123432435) |
c/1_07b/2_00b/26e/3_21f/cc8/7f3 |
I could set the minimum base (max_children) to something like 16 (this is already very low) and represent the length in hexadecimal to cover 2^64 chunks along a single dim.
In this way we maintain the original parametrization, while maintaining lexicographical key order that aligns with the chunk coordinates order.
What are your thoughts @jbms?
|
First converting to this parameterized base and then converting each "digit" according to that base to decimal or hexadecimal seems to introduce additional complexity in both the implementation and specification.. You then also have to zero-pad the decimal or hexadecimal representation of each "digit" to maintain lexicographical order. Also I think you no longer strictly adhere to Also, if the user doesn't specify a base that it is a power of 10 (if decimal representation is used) or 16 (if hexadecimal representation is used), it will be very difficult for humans to interpret. An advantage of specifying an arbitrary base, rather than just specifying a max number of digits before inserting a slash, is that it allows very precise control over the maximum number of children in a directory. But is there a use case for such precise control? The use cases that occur to me are all just about avoiding excessively large directories while avoiding creating excessively many directories --- for that, it would be sufficient to allow users to specify a limit within a factor of 10 or 16 rather than a precise limit. |
The latest release of zarr-python (3.1.3) added extensibility of chunk key encodings. This PR defines a
fanoutchunk key encoding that splits each chunk coordinate among multiple levels of hierarchy to reduce the number of files per directory. This is particularly useful for arrays with at least one very large dimension, as it could yield an inordinate number of files in a single directory with the default encoding, degraded filesystem performance.I have created a repository with a Python implementation of this encoding: Zarr-Fanout-CKE.
I am curious what your thoughts are. Some relevant choices for the spec:
fanoutis chosen, referring to the parameterized number of children per node (directory). Alternatives likehierarchicalare imo too generic (the default one is also hierarchical), andradixmight be too technical (what goes on underneath the hood)max_childrenis used as the user-defined parameter name over something likebaseorradixas it is more intuitive to the end-user what this parameter defines.max_entries_per_diris more descriptive, but the term 'dir' might be too specific? as the CKE could also be used with stores other than standard filesystems. The only thing that might be a bit awkward with this parameterization is that the 'base' ismax_children - 1, but again, this might not be relevant to the end user.