Skip to content

Core: Propagate Avro compression settings to manifest writers#15652

Merged
RussellSpitzer merged 4 commits intoapache:mainfrom
RussellSpitzer:ManifestWriterProperties
Mar 20, 2026
Merged

Core: Propagate Avro compression settings to manifest writers#15652
RussellSpitzer merged 4 commits intoapache:mainfrom
RussellSpitzer:ManifestWriterProperties

Conversation

@RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Mar 16, 2026

While working on #15634 I did some investigation about compression codecs since previously we did not allow changing the compression from the default for manifests. Briefly checking it looked like ZSTD was probably a much better option than the default GZIP but to my surprise we had no user facing properties to allow for changing that. Even the existing write.avro properties are ignored by our manifest writing code.

To fix this I plumb those properties through. I think as we move to V4 Adaptive Metadata we probably will want to reconsider our choice of GZIP as a default or even use uncompressed files for the root manifest. Benchmarks below

It can be argued that this is a breaking change for those folks who previously specified a different Avro compression codec but did not expect their manifest compression to change. IMHO, this is more of a bug fix for them since I would expect a change to the AVRO compression parameter to change the manifest writing compression as well.

For V4, we'll also propagate the Parquet ones.

Summary

  • Adds writerProperties support to ManifestWriter and ManifestFiles factory methods, allowing configuration to be passed through to the underlying Avro writer for all format versions (V1–V4).
  • SnapshotProducer now reads write.avro.compression-codec and write.avro.compression-level from table properties and forwards them when writing manifests, enabling per-table manifest compression configuration.
  • Includes tests verifying default (deflate) and custom (snappy) compression across all format versions, and a JMH benchmark (ManifestBenchmark) for comparing codec write/read performance.
image image

Cursor - cluade-4.6-opus-high

@github-actions github-actions bot added the core label Mar 16, 2026
@pvary
Copy link
Contributor

pvary commented Mar 16, 2026

Nice idea @RussellSpitzer!
Maybe we could even default to snappy?
Maybe the next graph should contain the sizes in a different chart, so it is easier to see not just the cost, but the gains as well 😄

@RussellSpitzer
Copy link
Member Author

Nice idea @RussellSpitzer! Maybe we could even default to snappy? Maybe the next graph should contain the sizes in a different chart, so it is easier to see not just the cost, but the gains as well 😄

Great Idea, I kept trying to figure out how to get that info in and another graph makes sense. I'll regenerate it

@RussellSpitzer
Copy link
Member Author

image

@RussellSpitzer
Copy link
Member Author

As a note for anyone checking those benchmarks, the only really compressible information we are adding to the manifests in these tests are the paths and the counts. The "min and max" for each column are just completely random and that's most of the manifest.

String codec =
metadata.property(
TableProperties.MANIFEST_COMPRESSION, TableProperties.MANIFEST_COMPRESSION_DEFAULT);
builder.put(TableProperties.AVRO_COMPRESSION, codec);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be format specific when we add in Parquet as well

Adds a writerProperties Map<String, String> to ManifestWriter that is
passed through to the underlying file format writer via set() calls.
This allows callers to configure writer behavior (e.g. compression
codec) when creating manifest writers through ManifestFiles.write().

Adds overloaded constructors to V3Writer and V4Writer, and a new
ManifestFiles.write() overload that accepts writer properties.
SnapshotProducer now reads write.avro.compression-codec and
write.avro.compression-level from table properties and passes them
through to ManifestWriter via new ManifestFiles factory overloads.
All ManifestWriter subclasses (V1–V4) apply these properties to the
underlying Avro writer, allowing manifest compression to be
configured per-table.

Includes tests verifying default (deflate) and custom (snappy)
compression across all format versions, and a JMH benchmark for
comparing codec performance.
@RussellSpitzer RussellSpitzer force-pushed the ManifestWriterProperties branch from fc879df to d8a7f0a Compare March 16, 2026 19:07
Long snapshotId,
Long firstRowId) {
return newWriter(
formatVersion, spec, encryptedOutputFile, snapshotId, firstRowId, Collections.emptyMap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Map.of()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess i'm old school where emptyMap was a singleton. But I read that in JDK 15+ Map.of() is now also a singleton so it's not a huge deal.

Let me do a quick search to see what we use more often

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Usages Files
Collections.emptyMap() 148 73
Map.of() (zero-arg) 25 22

Looks like I've been leaving my mark all over :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both in this PR.
I think we should start using Map.of() as it is part of the JDK now, but we can do it in another PR everywhere before the release along with changing the switch-es

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem doing some JDK modernization PRs

* @param writerProperties properties passed through to the underlying file writer
* @return a manifest writer
*/
public static ManifestWriter<DataFile> write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just a question. Do we want to keep all of the write versions? Shall we deprecate the old ones?

*/
public static ManifestWriter<DeleteFile> writeDeleteManifest(
int formatVersion, PartitionSpec spec, EncryptedOutputFile outputFile, Long snapshotId) {
return writeDeleteManifest(formatVersion, spec, outputFile, snapshotId, Collections.emptyMap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about the Map.of()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted above, we use Collection.emptyMap() more often. Although this may just be me leaving collection.emptyMap all throughout the codebase because in old versions of Java it was a singleton and Map.of() was not.

@RussellSpitzer
Copy link
Member Author

@pvary Thanks for your review, I think I've closed out everything except for the discussion on collections.emptyMap vs map.of and deprecating older methods.

Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer!
LGTM +1

@RussellSpitzer
Copy link
Member Author

Any other comments @danielcweeks ?

@RussellSpitzer
Copy link
Member Author

Merged! Thanks @danielcweeks , @pvary for the review!

@RussellSpitzer RussellSpitzer merged commit 08ac784 into apache:main Mar 20, 2026
34 checks passed
@RussellSpitzer RussellSpitzer deleted the ManifestWriterProperties branch March 20, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants