Skip to content

SPJ Broken After Partition Evolution (Iceberg 1.10.0) #15610

@andyguwc

Description

@andyguwc

Apache Iceberg version

1.10.0

Query engine

Spark

Please describe the bug 🐞

SPJ Broken After Partition Evolution (Iceberg 1.10.0)

Summary

Storage-Partitioned Join (SPJ) stops working for MERGE INTO operations after a table undergoes partition evolution, even when all data has been rewritten into the new partition layout and old snapshots/specs have been fully cleaned up. The workaround is to recreate the table from scratch with the desired partition spec.

Environment

  • Iceberg Spark Runtime: iceberg-spark-runtime-3.5_2.12-1.10.0
  • Spark: 3.5 (AWS Glue 5.0)
  • Catalog: AWS Glue Data Catalog

SPJ Configuration

The following Spark settings are configured and work correctly for tables that were created with their partition spec from the start (no evolution history):

spark.sql.sources.v2.bucketing.enabled = true
spark.sql.sources.v2.bucketing.pushPartValues.enabled = true
spark.sql.sources.v2.bucketing.pushPartKeys.enabled = true
spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled = true
spark.sql.requireAllClusterKeysForCoPartition = false
spark.sql.iceberg.planning.preserve-data-grouping = true

Table Definition

Tables are created with Iceberg format version 2, bucket partitioning on _ID, and hash distribution for writes and merges:

CREATE TABLE <catalog>.<db>.<table>
USING iceberg
PARTITIONED BY (bucket(64, _ID))
TBLPROPERTIES (
    'commit.manifest.min-count-to-merge' = '50',
    'format' = 'parquet',
    'format-version' = '2',
    'history.expire.max-snapshot-age-ms' = '604800000',
    'history.expire.min-snapshots-to-keep' = '10',
    'write.distribution-mode' = 'hash',
    'write.merge.distribution-mode' = 'hash',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '10',
    'write.target-file-size-bytes' = '536870912'
)

Steps Performed

  1. Table originally created with bucket(4, _ID)
  2. Evolved partition via ALTER TABLE DROP PARTITION FIELD / ALTER TABLE ADD PARTITION FIELD bucket(64, _ID)
  3. Ran rewrite_data_files with rewrite-all: true — all data files rewritten under new partition layout
  4. Ran expire_snapshots with retain_last: 1 and older_than_minutes: 1 — old snapshots fully removed
  5. Ran remove_orphan_files — old data files cleaned up from S3
  6. Verified via DESCRIBE TABLE EXTENDED: only Part 0: bucket(64, _ID) remains (single clean spec)
  7. Staging table created with matching PARTITIONED BY (bucket(64, _ID))

Expected

MERGE INTO should use SPJ (no Exchange/shuffle between staging and target table scans), since both tables share identical bucket(64, _ID) partitioning with a single spec.

Actual

Physical plan shows Exchange hashpartitioning(_ID, 32) and ShuffledHashJoin — full shuffle. Target table BatchScan shows groupedBy= (empty), while staging table correctly shows groupedBy=_ID_bucket. SPJ works for tables that were created with this partition spec originally (no evolution history).

Root Cause

Partition evolution metadata (even after cleanup) leaves residual state that prevents Spark from recognizing the table's bucket grouping during MERGE planning. Related upstream issue: apache/iceberg#13534 (closed without merge, Sep 2025).

Workaround

Recreate the table fresh via CREATE TABLE ... AS SELECT with the desired partition spec, then swap via ALTER TABLE RENAME. The new table has no evolution history and SPJ works immediately.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions