API, Core: Add FileIO to Scan API by nastra · Pull Request #15561 · apache/iceberg

nastra · 2026-03-09T11:49:38Z

This adds FileIO to the Scan API and is extracted from #15448

core/src/main/java/org/apache/iceberg/FileIOUtil.java

core/src/main/java/org/apache/iceberg/SerializableFileIO.java

core/src/main/java/org/apache/iceberg/FileIOUtil.java

core/src/main/java/org/apache/iceberg/SerializableTable.java

core/src/main/java/org/apache/iceberg/FileIOUtil.java

RussellSpitzer · 2026-03-10T17:53:29Z

api/src/main/java/org/apache/iceberg/Scan.java

+   * @return the {@link FileIO} instance to use when reading data files for this scan
+   */
+  default FileIO io() {
+    throw new UnsupportedOperationException("io() is not implemented: added in 1.11.0");


Not sure the version belongs here

This is mostly for unexpected implementations, so I was thinking that a version number would be helpful like a deprecation message that has one. I'm fine removing it, though.

I don't have a strong feeling here, are you thinking of this as a implementer facing message? I assume they won't really care when it's added, it's not like they can upgrade there way out of it. So i'll take it either way

I was thinking it could address urgent questions if you have deployed a new version of Iceberg and hit this, like "when was this added so I can roll back?" or "how long has this been broken?"

Who would that message be for though? I would think it's only for library integrators who would hopefully break immediately when they bump their dependency. I really don't mind though. We can keep it

api/src/main/java/org/apache/iceberg/Scan.java

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

rdblue · 2026-03-11T21:57:04Z

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

+    ImmutableMap.Builder<String, String> builder =
+        ImmutableMap.<String, String>builder().putAll(catalogProperties);
+    if (null != planId) {
+      builder.put(RESTCatalogProperties.REST_SCAN_PLAN_ID, planId);


It makes no sense to me that the ID that we use to internally track the plan ID is a public field where we keep properties to configure the REST catalog. Is this something we can change or has it been released?

this hasn't been released and we can still change it, but this should most likely be done in a separate PR because that property is being used in a few other places as well

Fixing in a separate PR is fine, but we don't want to replace one blocker with another endlessly as we find these issues.

We should also consider whether there are any alternatives to passing this mixed into catalog properties. Passing state like this in a property map along with config mixes concepts and causes weird API additions like this constant in RESTCatalogProperties.

rdblue · 2026-03-11T22:25:02Z

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

  private final FileIO tableIO;
  private String planId = null;
-  private FileIO fileIOForPlanId = null;
+  private FileIO scanFileIO = null;


The table's FileIO is not needed. It is passed in only once as table.io() and table is also passed in. And now that I'm looking at it, table not needed as a field because a parent (BaseScan) exposes table(). Both fields should be removed, along with the constructor argument for the table's IO.

I'm looking at the information passed around in this class more based on having a couple of unnecessary fields and I have a few more questions:

Why does this use Map<String, String> headers rather than Supplier<Map<String, String>> headers that the table has? Doesn't this mean that the headers from the auth session are static? How does credential refresh work if the authentication header is stale?

Was the TableOperations field originally used? It isn't used now, other than to pass it to refined scans. I think it should be removed from both the constructor and the fields.

This passes ResourcePaths and TableIdentifier to build 2 paths that could be passed in a 1 path using the plan ID. I think this should be reconsidered. The plan and fetch endpoints can be passed in as simple strings, and it would simplify this to use a Function<String, String> to construct the plan path with ID. Another option is to create an object similar to ResourcePaths that is specific to a table (hides TableIdentifier) and use that. Either one would be cleaner.

This also passes supportedEndpoints, which spreads out the logic for how to handle endpoints that aren't supported. These endpoints are also checked when needed, so this code will create a plan request and could then fail to fetch tasks if the fetch endpoint isn't supported. It also throws error messages that are not helpful, like "Server does not support endpoint: GET /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}" instead of "Invalid status: submitted (service does not support async planning)" that would be more helpful.

A cleaner way to handle endpoints is to verify required endpoints before creating this scan. Both the plan and fetch endpoints should be required. Then the optional endpoint should be booleans, like supportsAsync and supportsCancel.

Last, this leaks catalog properties and the Hadoop conf so that CatalogUtil.loadFileIO can be called with List<Credential>. Why not pass a function to create the FileIO so that the properties and conf are contained in the REST table operations?

I think this works right now and these aren't blockers (other than the potential auth session issue), but I would really like to see this class simplified by reducing the number of things that have to be passed to it and remove some of the things that are done here, like handling endpoint checks.

thanks @singhpk234 for opening #15595 to address those things

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

rdblue · 2026-03-12T21:39:33Z

core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java

          .withUseSnapshotSchema(true);
    } else if (snapshotId != null) {
-      boolean useSnapShotSchema = snapshotId != table.currentSnapshot().snapshotId();
+      boolean useSnapShotSchema = snapshotId != table().currentSnapshot().snapshotId();


I don't think this is a blocker, but it doesn't look correct. Additional nit: "snapshot" is one word and should be capitalized "Snapshot".

@singhpk234, The snapshot schema should be used when time traveling to a tag or a specific snapshot ID, but not when reading from a branch. That context comes from how the ref or snapshot was configured. Choosing a specific snapshot should generally send useSnapshotSchema=true, but just reading from a branch should not. Here is the description from the REST spec:

Whether to use the schema at the time the snapshot was written. When time travelling, the snapshot schema should be used (true). When scanning a branch, the table schema should be used (false).

This comparison isn't going to be sufficient because the distinction is whether the snapshot was selected via branch name vs directly by ID or by tag name. This needs to know whether useRef was called and whether the ref was a branch.

We also can't rely on comparing the result schema because DataTableScan (and its children) always override the schema to the branch/tag/snapshot schema and when useSnapshot or useRef are called. That's because the fields passed to select are applied lazily in planFiles. The snapshot/ref selection changes the base schema and columns are selected by name, or a specific projection is used directly.

To fix this, I think you need to override some of the API methods to detect how the scan is configured.

And while we're looking at schema projection, I think the projection code is also incorrect:

List<String> selectedColumns = schema().columns().stream().map(Types.NestedField::name).collect(Collectors.toList());

The problem here is that it will select top-level fields only because NestedField#name returns the local field's name and not any children. To get the full field names, you'd need to call TypeUtil.getProjectedIds and then pass each ID through schema().findColumnName(id) like you do for stats fields. This is what SnapshotScan does:

List<Integer> projectedFieldIds = Lists.newArrayList(TypeUtil.getProjectedIds(schema())); List<String> projectedFieldNames = projectedFieldIds.stream().map(schema()::findColumnName).collect(Collectors.toList());

rdblue · 2026-03-12T21:44:02Z

core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java

+
+    // make sure remote scan planning is called and FileIO gets the planId
+    assertThat(tableScan.planFiles()).hasSize(1);
+    assertThat(table.io().properties()).doesNotContainKey(RESTCatalogProperties.REST_SCAN_PLAN_ID);


I noted that mixing the plan ID into properties is not a great solution above. And I want to point out that this assertion is necessary because of it. We have to make sure we're not modifying the wrong property map and worry about cases where user-driven config passes in a hard-coded plan ID (see buildKeepingLast()). I don't know that there's an easy fix, but this is not a pattern we want in the codebase.

rdblue

These changes are fine in isolation, although this made me look into schema handling and I found a couple of bugs to be fixed in a follow-up. We also need to move the plan ID constant out of public catalog properties.

nastra · 2026-03-13T05:42:25Z

thanks for the reviews @RussellSpitzer and @rdblue. I'll get this merged so that we can address the other items in separate PRs

github-actions bot added API core labels Mar 9, 2026

nastra force-pushed the add-fileio-to-scan branch from 58b666d to 3b2c5b3 Compare March 9, 2026 11:59

nastra commented Mar 9, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/FileIOUtil.java Outdated Show resolved Hide resolved

RussellSpitzer self-requested a review March 10, 2026 02:00

nastra requested a review from rdblue March 10, 2026 05:45

nastra commented Mar 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializableFileIO.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializableFileIO.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/FileIOUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/SerializableTable.java Outdated Show resolved Hide resolved

nastra force-pushed the add-fileio-to-scan branch from 7e21817 to 824ed65 Compare March 10, 2026 17:41

rdblue reviewed Mar 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/FileIOUtil.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Mar 10, 2026

View reviewed changes

api/src/main/java/org/apache/iceberg/Scan.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 10, 2026

View reviewed changes