feat: add conditional codec #27

mkitti · 2025-10-15T23:42:40Z

This pull request adds the conditional (formerly optional) codec as an extension.

Abstract

The conditional codec is a meta-codec that enables or disables an encapsulated sequence of other codecs on a per-chunk basis. It achieves this by wrapping a user-defined list of codecs in its configuration and prepending a bitfield header to the byte stream of each chunk. Each bit in the header corresponds to a codec in the encapsulated list, indicating whether it should be applied or skipped. This allows for dynamic, data-dependent optimization of the codec pipeline, such as disabling compression when it provides no benefit.

mkitti · 2025-10-15T23:44:48Z

Draft on HackMD: https://hackmd.io/@zarr/rJRh51apex

codecs/optional/README.md

Clarify the optional codec's upper bound mechanism and its benefits for sharded stores. Include detailed calculations for chunk offsets and emphasize performance improvements in write operations.

Corrected code block syntax highlighting and fixed a typo in the example. Use `python` instead of `python=` for code blocks.

Clarify the impact of padded shard layout on disk size and explain the shard compaction process for long-term storage optimization.

mkitti · 2025-10-16T08:51:56Z

codecs/conditional/README.md

+
+* **Parallel I/O Within a Shard:** With predictable offsets, multiple threads or processes can issue concurrent write operations to different chunks within the same shard file using overlapped I/O.
+
+This transforms a shard from a monolithic object that must be written sequentially into a parallel-access container. It significantly boosts write throughput in high-performance computing (HPC) and other concurrent data processing environments where multiple workers need to write to the same dataset simultaneously.


@jakirkham I included this section based on our discussion of "padding" in shards. Is this along the lines of what you had in mind?

normanrz · 2025-10-16T08:58:07Z

Thanks for writing this up. Especially appreciate all the motivating examples.

Initially, my mental model was that this would work similar to an "optional" data type in various programming languages. The difference to your model is that it would only hold a single codec that can be toggled. Have you considered that?

normanrz · 2025-10-16T09:02:51Z

Another note, by toggling codecs off selectively, it could become possible to create invalid codec pipelines. Primarily, if the array-to-bytes codec gets toggled off. I think it would be good to have a normative section about what codecs this higher-order codec can be applied to.

mkitti · 2025-10-16T13:27:23Z

Another note, by toggling codecs off selectively, it could become possible to create invalid codec pipelines. Primarily, if the array-to-bytes codec gets toggled off. I think it would be good to have a normative section about what codecs this higher-order codec can be applied to.

It only survived in the Discussion section, but I only meant to define a bytes-to-bytes codec here. That means all encapsulated codecs must also be bytes-to-bytes codecs.

An array-to-array optional codec would work pretty similarly though in that one can toggle codecs arbitrarily. However, there is an additional validity issue in that certain array-to-array codecs may have constraints on the dimensionality of their array inputs and outputs.

An array-to-bytes optional codec would need to be mutually exclusive. In that case, I would argue that we may want an array-to-bytes optional codec to work differently. The header might be interpreted as an integer index rather than a bitfield. A single byte could select among 256 array-to-bytes codecs.

In summary, only a bytes-to-bytes optional codec is intended to be defined here. The encapsulated pipeline thus can only contain other byte-to-byte codecs.

Would it even make sense to have a single codec that could be an array-to-array codec, a bytes-to-array codec, or a bytes-to-bytes codec? I believe these should be three separate codecs. Could those three codecs share a name?

jbms · 2025-10-16T15:44:56Z

Other related work to consider: OpenZL (https://openzl.org/) from facebook

In regards to future expansion: the current design allows future expansion at the end of the codec list but not at the beginning or middle. For example it would not be possible to add an additional pre-filter like shuffle afterwards. I suppose this can be mitigated by duplicating codecs in the list as needed.

mkitti · 2025-10-16T22:11:52Z

In regards to future expansion: the current design allows future expansion at the end of the codec list but not at the beginning or middle. For example it would not be possible to add an additional pre-filter like shuffle afterwards. I suppose this can be mitigated by duplicating codecs in the list as needed

We currently lack an identity bytes-to-bytes codec that could serve as a placeholder. That would be a codec whose output is exactly the input byte sequence or stream.

The optional codec could serve as an identity codec if it encapsulated no codecs and had header_bits set or default to 0 as in the following JSON.

{
    "name": "optional",
    "configuration": {
        "codecs": []
    }
}

If we want to reserve space to prepend a codec, we could start the encapsulated codec chain with an optional codec configured as an identity codec.

{
    "name": "optional",
    "configuration": {
        "codecs": [
            {
                "name": "optional",
                "configuration": {
                    "codecs": []
                 }
            },
            // additional codecs ...
        ]
    }
}

jbms · 2025-10-16T22:19:16Z

In regards to future expansion: the current design allows future expansion at the end of the codec list but not at the beginning or middle. For example it would not be possible to add an additional pre-filter like shuffle afterwards. I suppose this can be mitigated by duplicating codecs in the list as needed

We currently lack an identity bytes-to-bytes codec that could serve as a placeholder. That would be a codec whose output is exactly the input byte sequence or stream.

The optional codec could serve as an identity codec if it encapsulated no codecs and had header_bits set or default to 0 as in the following JSON.
{
    "name": "optional",
    "configuration": {
        "codecs": []
    }
}
If we want to reserve space to prepend a codec, we could start the encapsulated codec chain with an optional codec configured as an identity codec.

That is true but instead using null to explicitly mean reserved for future use may be better because then you would fail if you encounter it.

mkitti · 2025-10-17T08:29:40Z

Do you mean that the element of the codec list should itself be null and that null is the identity codec?

{
    "name": "optional",
    "configuration": {
        "codecs": [
            null,
            // additional codecs ...
        ]
    }
}

or do you mean that the codec list itself should be null?

{
    "name": "optional",
    "configuration": {
        "codecs": [
            {
                "name": "optional",
                "configuration": {
                    "codecs": null
                 }
            },
            // additional codecs ...
        ]
    }
}

mkitti · 2025-10-17T08:35:35Z

I think you mean that element of the codec list should be null such that if the "null codec" were enabled then there would be an error. null does not mean the identity codec but explicitly one that is invalid and would generate an error.

mkitti · 2025-10-17T09:15:03Z

It has been proposed to change the name of this codec to conditional.

jbms · 2025-10-17T15:58:25Z

I think you mean that element of the codec list should be null such that if the "null codec" were enabled then there would be an error. null does not mean the identity codec but explicitly one that is invalid and would generate an error.

Yes that's what I meant.

mkitti · 2025-10-18T08:45:40Z

After thinking about this more, perhaps we should add an explicit ErrorCodec that explicitly will error if enabled and can carry a specific error message.

Depending on null to do this sounds like a continuation of the "billion dollar mistake".

normanrz

This PR meets the requirements for being merged. Let me know when you're ready to.

mkitti · 2025-10-20T10:57:40Z

~~The file needs to be moved from the optional directory to the conditional directory.~~

edit: Resolved in 5528471

codecs/optional/README.md

Co-authored-by: Davis Bennett <[email protected]>

mkitti · 2025-10-20T11:28:01Z

In anticipation of a future array-to-array or array-to-bytes conditional codec, is there any provision that we would want to make here?

For example, we could make from_type and to to_type parameters where the only current valid value is "bytes".

Alternatively, the AA or AB codecs could just have a different name. The AB codec would probably be more like a selection codec where the header contains an integer to select exactly one or zero codecs.

jbms · 2025-10-20T13:21:37Z

An array to array version of this has the issue of how to encode the extra bits, since its only output is an array. I think it would need to be an array to bytes codec, where the sub-codecs can be any kind but the selected subset must be in a valid order and include exactly one array to bytes codec. Alternatively there could be a way to specify a non-conditional array to bytes codec to use.

mkitti · 2025-10-20T13:32:55Z

For now, I'm am thinking about how to distinguish this bytes-to-bytes codec from another conceptually similar codec of a similar name that may be implemented in the future.

That said I am now convinced that an array-to-x analog of this would be substantially different enough to require a different name completely.

An array-to-array version might require a per-chunk metadata facility that we do not have yet.

mkitti · 2025-10-20T15:00:37Z

Should the condition be serialized somehow?

The condition does not need to be serialized. It is not needed to read the data. Also, another implementation does not need to use the same condition when writing data.
The condition does need to be serialized in order to round-trip the encoding of the array.

While I do not think we should require the condition to be serialized, would it be worthwhile to have a common way of describing a few conditions as anticipated here? For example, we could have an optional condition field. Potential conditions could be:

"if_available" - use the encapsulated codec chain if it is available
"disabled" - do not apply the encapsulated codec chain (reserve it for later use?)
"if_smaller_size" - only encode if the number of output bytes is smaller than the number of input bytes.
"if_smaller_or_equal_size" - only encode if the number of output bytes is smaller or equal than the number of input bytes
{ "max_nbytes": N } - the inclusive maximum size of the encoded chunk, not including the conditional header, must be equal to or less than N bytes where N is an integer.

The alternative is that this all could be additional codecs in and of themselves.

d-v-b · 2025-10-20T15:17:35Z

The condition does need to be serialized in order to round-trip the encoding of the array.

When reading, you can copy the bitmask, so I don't think you need the original decision-making procedure to re-generate the same bytes. But I also don't think byte-identical round-tripping is important here. Ensuring encoded data can be decoded is probably a better objective. I think the reason for choosing to use a codec or not is akin to the order in which sub-chunks are written -- basically a runtime thing that readers don't need to know about.

mkitti · 2025-10-23T14:19:09Z

I'm good for this to be merged now as is.

Add optional codec

0107674

normanrz reviewed Oct 16, 2025

View reviewed changes

codecs/optional/README.md Outdated Show resolved Hide resolved

mkitti added 3 commits October 16, 2025 04:00

Enhance README with detailed optional codec explanation

8f3a70d

Clarify the optional codec's upper bound mechanism and its benefits for sharded stores. Include detailed calculations for chunk offsets and emphasize performance improvements in write operations.

Fix code block syntax and typo in README

8c4096a

Corrected code block syntax highlighting and fixed a typo in the example. Use `python` instead of `python=` for code blocks.

Update README with details on shard compaction

e5fb4db

Clarify the impact of padded shard layout on disk size and explain the shard compaction process for long-term storage optimization.

mkitti commented Oct 16, 2025

View reviewed changes

mkitti changed the title ~~Add optional codec~~ feat: add conditional codec Oct 17, 2025

Rename 'Optional Codec' to 'Conditional Codec'

e1c1e58

normanrz approved these changes Oct 20, 2025

View reviewed changes

d-v-b reviewed Oct 20, 2025

View reviewed changes

codecs/optional/README.md Outdated Show resolved Hide resolved

Update codecs/optional/README.md

19dad17

Co-authored-by: Davis Bennett <[email protected]>

Move directory from optional to conditional

5528471

normanrz approved these changes Oct 28, 2025

View reviewed changes

normanrz merged commit 0164eeb into zarr-developers:main Oct 28, 2025

LDeakin mentioned this pull request Nov 2, 2025

Support zarr-developers/zarr-extensions#27: conditional codec zarrs/zarrs#295

Open


		* Parallel I/O Within a Shard: With predictable offsets, multiple threads or processes can issue concurrent write operations to different chunks within the same shard file using overlapped I/O.

		This transforms a shard from a monolithic object that must be written sequentially into a parallel-access container. It significantly boosts write throughput in high-performance computing (HPC) and other concurrent data processing environments where multiple workers need to write to the same dataset simultaneously.

feat: add conditional codec #27

feat: add conditional codec #27

Uh oh!

Conversation

mkitti commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Uh oh!

mkitti commented Oct 15, 2025

Uh oh!

Uh oh!

mkitti Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

normanrz commented Oct 16, 2025

Uh oh!

normanrz commented Oct 16, 2025

Uh oh!

mkitti commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbms commented Oct 16, 2025

Uh oh!

mkitti commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbms commented Oct 16, 2025

Uh oh!

mkitti commented Oct 17, 2025

Uh oh!

mkitti commented Oct 17, 2025

Uh oh!

mkitti commented Oct 17, 2025

Uh oh!

jbms commented Oct 17, 2025

Uh oh!

mkitti commented Oct 18, 2025

Uh oh!

normanrz left a comment

Choose a reason for hiding this comment

Uh oh!

mkitti commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mkitti commented Oct 20, 2025

Uh oh!

jbms commented Oct 20, 2025

Uh oh!

mkitti commented Oct 20, 2025

Uh oh!

mkitti commented Oct 20, 2025

Uh oh!

d-v-b commented Oct 20, 2025

Uh oh!

mkitti commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mkitti commented Oct 15, 2025 •

edited

Loading

mkitti commented Oct 16, 2025 •

edited

Loading

mkitti commented Oct 16, 2025 •

edited

Loading

mkitti commented Oct 20, 2025 •

edited

Loading