Skip to content

Spark: backport stream-results option for remove orphan files#234

Open
dushyantk1509 wants to merge 1 commit intolinkedin:openhouse-1.5.2from
dushyantk1509:dushyantk1509/stream-results-orphan-files
Open

Spark: backport stream-results option for remove orphan files#234
dushyantk1509 wants to merge 1 commit intolinkedin:openhouse-1.5.2from
dushyantk1509:dushyantk1509/stream-results-orphan-files

Conversation

@dushyantk1509
Copy link
Copy Markdown

@dushyantk1509 dushyantk1509 commented Mar 17, 2026

Currently, ~ 30% of resources is consumed by OFD maintenance jobs because of very high spark memory configurations. To address this, this PR backport apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K.

When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged.

Tested using openhouse local docker setup - https://github.com/linkedin/openhouse/blob/main/SETUP.md#test-through-spark-shell

  • Created 5 orphan files.
  • Spark call ran successfully and deleted all these files - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test', stream_results => true)").show(false)
  • SparkAction also ran fine - SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).option("stream-results", "true").execute()
  • Default (non-streaming) calls also succeeded - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test')").show(false) & SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).execute() and deleted orphan files.

Backport of apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results
option to DeleteOrphanFilesSparkAction to prevent driver OOM when
removing large numbers of orphan files. Instead of collecting all orphan
file paths into driver memory, files are streamed partition-by-partition
using toLocalIterator() and deleted in batches of 100K.

When enabled, the result contains a sample of up to 20,000 file paths.
The total count of deleted files is logged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the SPARK label Mar 17, 2026
@dushyantk1509 dushyantk1509 marked this pull request as ready for review March 17, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant