Core: Propagate Avro compression settings to manifest writers#15652
Core: Propagate Avro compression settings to manifest writers#15652RussellSpitzer merged 4 commits intoapache:mainfrom
Conversation
|
Nice idea @RussellSpitzer! |
Great Idea, I kept trying to figure out how to get that info in and another graph makes sense. I'll regenerate it |
|
As a note for anyone checking those benchmarks, the only really compressible information we are adding to the manifests in these tests are the paths and the counts. The "min and max" for each column are just completely random and that's most of the manifest. |
| String codec = | ||
| metadata.property( | ||
| TableProperties.MANIFEST_COMPRESSION, TableProperties.MANIFEST_COMPRESSION_DEFAULT); | ||
| builder.put(TableProperties.AVRO_COMPRESSION, codec); |
There was a problem hiding this comment.
This will have to be format specific when we add in Parquet as well
Adds a writerProperties Map<String, String> to ManifestWriter that is passed through to the underlying file format writer via set() calls. This allows callers to configure writer behavior (e.g. compression codec) when creating manifest writers through ManifestFiles.write(). Adds overloaded constructors to V3Writer and V4Writer, and a new ManifestFiles.write() overload that accepts writer properties.
SnapshotProducer now reads write.avro.compression-codec and write.avro.compression-level from table properties and passes them through to ManifestWriter via new ManifestFiles factory overloads. All ManifestWriter subclasses (V1–V4) apply these properties to the underlying Avro writer, allowing manifest compression to be configured per-table. Includes tests verifying default (deflate) and custom (snappy) compression across all format versions, and a JMH benchmark for comparing codec performance.
fc879df to
d8a7f0a
Compare
| Long snapshotId, | ||
| Long firstRowId) { | ||
| return newWriter( | ||
| formatVersion, spec, encryptedOutputFile, snapshotId, firstRowId, Collections.emptyMap()); |
There was a problem hiding this comment.
I guess i'm old school where emptyMap was a singleton. But I read that in JDK 15+ Map.of() is now also a singleton so it's not a huge deal.
Let me do a quick search to see what we use more often
There was a problem hiding this comment.
| Method | Usages | Files |
|---|---|---|
Collections.emptyMap() |
148 | 73 |
Map.of() (zero-arg) |
25 | 22 |
Looks like I've been leaving my mark all over :)
There was a problem hiding this comment.
I'm fine with both in this PR.
I think we should start using Map.of() as it is part of the JDK now, but we can do it in another PR everywhere before the release along with changing the switch-es
There was a problem hiding this comment.
I have no problem doing some JDK modernization PRs
| * @param writerProperties properties passed through to the underlying file writer | ||
| * @return a manifest writer | ||
| */ | ||
| public static ManifestWriter<DataFile> write( |
There was a problem hiding this comment.
Really just a question. Do we want to keep all of the write versions? Shall we deprecate the old ones?
| */ | ||
| public static ManifestWriter<DeleteFile> writeDeleteManifest( | ||
| int formatVersion, PartitionSpec spec, EncryptedOutputFile outputFile, Long snapshotId) { | ||
| return writeDeleteManifest(formatVersion, spec, outputFile, snapshotId, Collections.emptyMap()); |
There was a problem hiding this comment.
Same question about the Map.of()
There was a problem hiding this comment.
Noted above, we use Collection.emptyMap() more often. Although this may just be me leaving collection.emptyMap all throughout the codebase because in old versions of Java it was a singleton and Map.of() was not.
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
|
@pvary Thanks for your review, I think I've closed out everything except for the discussion on collections.emptyMap vs map.of and deprecating older methods. |
pvary
left a comment
There was a problem hiding this comment.
Thanks @RussellSpitzer!
LGTM +1
|
Any other comments @danielcweeks ? |
|
Merged! Thanks @danielcweeks , @pvary for the review! |

While working on #15634 I did some investigation about compression codecs since previously we did not allow changing the compression from the default for manifests. Briefly checking it looked like ZSTD was probably a much better option than the default GZIP but to my surprise we had no user facing properties to allow for changing that. Even the existing write.avro properties are ignored by our manifest writing code.
To fix this I plumb those properties through. I think as we move to V4 Adaptive Metadata we probably will want to reconsider our choice of GZIP as a default or even use uncompressed files for the root manifest. Benchmarks below
It can be argued that this is a breaking change for those folks who previously specified a different Avro compression codec but did not expect their manifest compression to change. IMHO, this is more of a bug fix for them since I would expect a change to the AVRO compression parameter to change the manifest writing compression as well.
For V4, we'll also propagate the Parquet ones.
Summary
writerPropertiessupport toManifestWriterandManifestFilesfactory methods, allowing configuration to be passed through to the underlying Avro writer for all format versions (V1–V4).SnapshotProducernow readswrite.avro.compression-codecandwrite.avro.compression-levelfrom table properties and forwards them when writing manifests, enabling per-table manifest compression configuration.ManifestBenchmark) for comparing codec write/read performance.Cursor - cluade-4.6-opus-high