Spark 4.1: Pass FileIO on Spark's read path by nastra · Pull Request #15448 · apache/iceberg

nastra · 2026-02-26T07:58:18Z

When accessing/reading data files, the codebase is using the Table's FileIO instance through table.io() on Spark's read path. With remote scan planning the FileIO instance is configured with a PlanID + custom storage credentials inside RESTTableScan, but that instance is never propagated to the place(s) that actually perform the read., thus leading to errors.

This PR passes the FileIO obtained during remote/distributed scan planning next to the Table instance on Spark's read path.

This is an alternative to #15368 and requires SerializableFileIOWithSize, which makes sure that the FileIO instance is only closed on the driver and not on executor nodes (similar to SerializableTableWithSize).

core/src/main/java/org/apache/iceberg/SerializableFileIO.java

nastra · 2026-03-06T17:18:11Z

spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/TestBaseWithCatalog.java

              CatalogProperties.CLIENT_POOL_SIZE,
-              "1"));
+              "1",
+              "include-credentials",


we need the server to return some dummy credentials so that we can properly test that FileIO with the planID + storage credentials is propagated in TestRESTScanPlanning

open-api/src/testFixtures/java/org/apache/iceberg/rest/RESTServerCatalogAdapter.java

nastra · 2026-03-06T17:27:05Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SerializableFileIOWithSize.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class SerializableFileIOWithSize


this makes sure that the FileIO instance is only closed on the driver and not on executor nodes and is similar to SerializableTableWithSize

Is this needed? What is the advantage? I don't recall why we had to do it with table, so you may want to check with @aokolnychyi.

I would highly prefer not adding an extra class unless it is definitely necessary.

SerializableTableWithSize was added by @bryanck back then and I was initially testing the code changes without this class but it is definitely needed to avoid closing FileIO on executor nodes

#7263 is the PR that moved away from serializing the FileIO. The problem is that Spark closed the FileIO as part of broadcast cleanup, which closed the shared S3 client.

I agree with adding this class; I think we ultimately do need to preserve size estimation property to prevent any regressions and I think we definitley need to preserve making sure the driver fileIO is not closed during broadcast cleanup. If we agree that we need to preserve these properties, and it looks like there are at least 2 places where now we want to broadcast the FileIO , I think there's a good argument to have a wrapper class. If we manage to shrink down what we need to send over the wire maybe there's a simpler structure but whatever we broadcast probably does need to implement FileIO which brings along all the implementations to satisfy that interface.

My only question is does the class need to be public? It looks like it's only used in this module so I think it could be package private.

+1 to Package private, I'm not as clear on the benefit of handling this Size Estimate, but the closing behavior seems important to match. Feels like Spark should have an easier method though for doing this, like broadcast(var, estimatedSize) or something to avoid us having to implement an interface.

As a thought, what's stopping us from doing something like

SerializableTableWithSize.copyOf(table, scan.io()));

Where we use the table object and just attach the scan io to it?

core/src/main/java/org/apache/iceberg/SerializableFileIO.java

api/src/main/java/org/apache/iceberg/Scan.java

core/src/main/java/org/apache/iceberg/SerializableTable.java

api/src/main/java/org/apache/iceberg/BatchScan.java

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java

rdblue · 2026-03-06T21:22:52Z

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java

+  @SuppressWarnings("unchecked")
+  private <T extends RESTResponse> T maybeAddStorageCredential(T response) {
+    if (response instanceof PlanTableScanResponse resp
+        && PlanStatus.COMPLETED == resp.planStatus()) {


I'm looking into when we can return a FileIO from the scan and I was surprised to see that storage-credentials is returned for any CompletedPlanningResult rather than CompletedPlanningWithIDResult. The one with ID is used for completed responses from the plan endpoint, while the generic result is returned from both plan and fetch endpoints.

Why can we return storage credentials when fetching tasks? Wouldn't it make more sense to return credentials once per planning operation? That simplifies when we know we have credentials.

That wouldn't solve the problem of needing to wait until after planFiles is called to get the FileIO, but it would at least simplify the protocol so we don't need to try to create a new FileIO after each call to fetch more tasks. (@danielcweeks, any thoughts on this?)

I guess if we want to change this we would at the very least need to update the Spec that we added in #14563

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

rdblue · 2026-03-06T21:28:37Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

  SparkScan(
      SparkSession spark,
      Table table,
+      Supplier<FileIO> fileIO,


This is a supplier because it will be a broadcast?

this is a supplier because planFiles() will be called later in the read path and so the actual FileIO with the right credentials will only be available later

I actually considered using Supplier<FileIO> for the API from scan. Now that it's clear that we will create a supplier anyway, should we just update the scan interface so that the caller doesn't need to wrap it?

The nice thing about that is that we don't rely on docs or runtime exceptions (unless you call get too early). Returning a supplier signals to the caller that they should find out when the FileIO is accessible.

yeah I think that makes sense. I've created #15646 for that

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/PositionDeletesRowReader.java

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

RussellSpitzer · 2026-03-18T02:49:47Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SerializableFileIOWithSize.java

+
+  @Override
+  public long estimatedSize() {
+    return SIZE_ESTIMATE;


Size estimate of what? The FileIO's serialized representation?

this is just a hardcoded size so that sizes don't have to be re-calculated (similar to how we have it in SerializableTableWithSize)

github-actions bot added API spark core OPENAPI labels Feb 26, 2026

nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 000673a to dc8a4f7 Compare February 26, 2026 10:44

nastra marked this pull request as draft March 3, 2026 09:00

nastra commented Mar 6, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializableFileIO.java Outdated Show resolved Hide resolved

nastra commented Mar 6, 2026

View reviewed changes

open-api/src/testFixtures/java/org/apache/iceberg/rest/RESTServerCatalogAdapter.java Show resolved Hide resolved

nastra commented Mar 6, 2026

View reviewed changes

nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 45a44aa to 122ccc8 Compare March 6, 2026 17:30