-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Apache Iceberg version
1.10.0
Query engine
Spark
Please describe the bug 🐞
SPJ Broken After Partition Evolution (Iceberg 1.10.0)
Summary
Storage-Partitioned Join (SPJ) stops working for MERGE INTO operations after a table undergoes partition evolution, even when all data has been rewritten into the new partition layout and old snapshots/specs have been fully cleaned up. The workaround is to recreate the table from scratch with the desired partition spec.
Environment
- Iceberg Spark Runtime:
iceberg-spark-runtime-3.5_2.12-1.10.0 - Spark: 3.5 (AWS Glue 5.0)
- Catalog: AWS Glue Data Catalog
SPJ Configuration
The following Spark settings are configured and work correctly for tables that were created with their partition spec from the start (no evolution history):
spark.sql.sources.v2.bucketing.enabled = true
spark.sql.sources.v2.bucketing.pushPartValues.enabled = true
spark.sql.sources.v2.bucketing.pushPartKeys.enabled = true
spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled = true
spark.sql.requireAllClusterKeysForCoPartition = false
spark.sql.iceberg.planning.preserve-data-grouping = true
Table Definition
Tables are created with Iceberg format version 2, bucket partitioning on _ID, and hash distribution for writes and merges:
CREATE TABLE <catalog>.<db>.<table>
USING iceberg
PARTITIONED BY (bucket(64, _ID))
TBLPROPERTIES (
'commit.manifest.min-count-to-merge' = '50',
'format' = 'parquet',
'format-version' = '2',
'history.expire.max-snapshot-age-ms' = '604800000',
'history.expire.min-snapshots-to-keep' = '10',
'write.distribution-mode' = 'hash',
'write.merge.distribution-mode' = 'hash',
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '10',
'write.target-file-size-bytes' = '536870912'
)Steps Performed
- Table originally created with
bucket(4, _ID) - Evolved partition via
ALTER TABLE DROP PARTITION FIELD/ALTER TABLE ADD PARTITION FIELD bucket(64, _ID) - Ran
rewrite_data_fileswithrewrite-all: true— all data files rewritten under new partition layout - Ran
expire_snapshotswithretain_last: 1andolder_than_minutes: 1— old snapshots fully removed - Ran
remove_orphan_files— old data files cleaned up from S3 - Verified via
DESCRIBE TABLE EXTENDED: onlyPart 0: bucket(64, _ID)remains (single clean spec) - Staging table created with matching
PARTITIONED BY (bucket(64, _ID))
Expected
MERGE INTO should use SPJ (no Exchange/shuffle between staging and target table scans), since both tables share identical bucket(64, _ID) partitioning with a single spec.
Actual
Physical plan shows Exchange hashpartitioning(_ID, 32) and ShuffledHashJoin — full shuffle. Target table BatchScan shows groupedBy= (empty), while staging table correctly shows groupedBy=_ID_bucket. SPJ works for tables that were created with this partition spec originally (no evolution history).
Root Cause
Partition evolution metadata (even after cleanup) leaves residual state that prevents Spark from recognizing the table's bucket grouping during MERGE planning. Related upstream issue: apache/iceberg#13534 (closed without merge, Sep 2025).
Workaround
Recreate the table fresh via CREATE TABLE ... AS SELECT with the desired partition spec, then swap via ALTER TABLE RENAME. The new table has no evolution history and SPJ works immediately.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time