-
Notifications
You must be signed in to change notification settings - Fork 10
feat: add conditional codec #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Draft on HackMD: https://hackmd.io/@zarr/rJRh51apex |
Clarify the optional codec's upper bound mechanism and its benefits for sharded stores. Include detailed calculations for chunk offsets and emphasize performance improvements in write operations.
Corrected code block syntax highlighting and fixed a typo in the example. Use `python` instead of `python=` for code blocks.
Clarify the impact of padded shard layout on disk size and explain the shard compaction process for long-term storage optimization.
|
|
||
| * **Parallel I/O Within a Shard:** With predictable offsets, multiple threads or processes can issue concurrent write operations to different chunks within the same shard file using overlapped I/O. | ||
|
|
||
| This transforms a shard from a monolithic object that must be written sequentially into a parallel-access container. It significantly boosts write throughput in high-performance computing (HPC) and other concurrent data processing environments where multiple workers need to write to the same dataset simultaneously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakirkham I included this section based on our discussion of "padding" in shards. Is this along the lines of what you had in mind?
|
Thanks for writing this up. Especially appreciate all the motivating examples. Initially, my mental model was that this would work similar to an "optional" data type in various programming languages. The difference to your model is that it would only hold a single codec that can be toggled. Have you considered that? |
|
Another note, by toggling codecs off selectively, it could become possible to create invalid codec pipelines. Primarily, if the array-to-bytes codec gets toggled off. I think it would be good to have a normative section about what codecs this higher-order codec can be applied to. |
It only survived in the Discussion section, but I only meant to define a bytes-to-bytes codec here. That means all encapsulated codecs must also be bytes-to-bytes codecs. An array-to-array optional codec would work pretty similarly though in that one can toggle codecs arbitrarily. However, there is an additional validity issue in that certain array-to-array codecs may have constraints on the dimensionality of their array inputs and outputs. An array-to-bytes optional codec would need to be mutually exclusive. In that case, I would argue that we may want an array-to-bytes optional codec to work differently. The header might be interpreted as an integer index rather than a bitfield. A single byte could select among 256 array-to-bytes codecs. In summary, only a bytes-to-bytes optional codec is intended to be defined here. The encapsulated pipeline thus can only contain other byte-to-byte codecs. Would it even make sense to have a single codec that could be an array-to-array codec, a bytes-to-array codec, or a bytes-to-bytes codec? I believe these should be three separate codecs. Could those three codecs share a name? |
|
Other related work to consider: OpenZL (https://openzl.org/) from facebook In regards to future expansion: the current design allows future expansion at the end of the codec list but not at the beginning or middle. For example it would not be possible to add an additional pre-filter like shuffle afterwards. I suppose this can be mitigated by duplicating codecs in the list as needed. |
We currently lack an identity bytes-to-bytes codec that could serve as a placeholder. That would be a codec whose output is exactly the input byte sequence or stream. The optional codec could serve as an identity codec if it encapsulated no codecs and had {
"name": "optional",
"configuration": {
"codecs": []
}
}If we want to reserve space to prepend a codec, we could start the encapsulated codec chain with an {
"name": "optional",
"configuration": {
"codecs": [
{
"name": "optional",
"configuration": {
"codecs": []
}
},
// additional codecs ...
]
}
} |
That is true but instead using null to explicitly mean reserved for future use may be better because then you would fail if you encounter it. |
|
Do you mean that the element of the codec list should itself be or do you mean that the codec list itself should be |
|
I think you mean that element of the codec list should be |
|
It has been proposed to change the name of this codec to |
Yes that's what I meant. |
|
After thinking about this more, perhaps we should add an explicit Depending on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR meets the requirements for being merged. Let me know when you're ready to.
|
edit: Resolved in 5528471 |
Co-authored-by: Davis Bennett <[email protected]>
|
In anticipation of a future array-to-array or array-to-bytes conditional codec, is there any provision that we would want to make here? For example, we could make Alternatively, the AA or AB codecs could just have a different name. The AB codec would probably be more like a |
|
An array to array version of this has the issue of how to encode the extra bits, since its only output is an array. I think it would need to be an array to bytes codec, where the sub-codecs can be any kind but the selected subset must be in a valid order and include exactly one array to bytes codec. Alternatively there could be a way to specify a non-conditional array to bytes codec to use. |
|
For now, I'm am thinking about how to distinguish this bytes-to-bytes codec from another conceptually similar codec of a similar name that may be implemented in the future. That said I am now convinced that an array-to-x analog of this would be substantially different enough to require a different name completely. An array-to-array version might require a per-chunk metadata facility that we do not have yet. |
|
Should the
While I do not think we should require the condition to be serialized, would it be worthwhile to have a common way of describing a few conditions as anticipated here? For example, we could have an optional
The alternative is that this all could be additional codecs in and of themselves. |
When reading, you can copy the bitmask, so I don't think you need the original decision-making procedure to re-generate the same bytes. But I also don't think byte-identical round-tripping is important here. Ensuring encoded data can be decoded is probably a better objective. I think the reason for choosing to use a codec or not is akin to the order in which sub-chunks are written -- basically a runtime thing that readers don't need to know about. |
|
I'm good for this to be merged now as is. |
This pull request adds the
conditional(formerlyoptional) codec as an extension.Abstract
The
conditionalcodec is a meta-codec that enables or disables an encapsulated sequence of other codecs on a per-chunk basis. It achieves this by wrapping a user-defined list of codecs in its configuration and prepending a bitfield header to the byte stream of each chunk. Each bit in the header corresponds to a codec in the encapsulated list, indicating whether it should be applied or skipped. This allows for dynamic, data-dependent optimization of the codec pipeline, such as disabling compression when it provides no benefit.