Skip to content

API, Core: Add FileIO to Scan API#15561

Merged
nastra merged 6 commits intoapache:mainfrom
nastra:add-fileio-to-scan
Mar 13, 2026
Merged

API, Core: Add FileIO to Scan API#15561
nastra merged 6 commits intoapache:mainfrom
nastra:add-fileio-to-scan

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Mar 9, 2026

This adds FileIO to the Scan API and is extracted from #15448

@RussellSpitzer RussellSpitzer self-requested a review March 10, 2026 02:00
@nastra nastra requested a review from rdblue March 10, 2026 05:45
@nastra nastra force-pushed the add-fileio-to-scan branch from 7e21817 to 824ed65 Compare March 10, 2026 17:41
* @return the {@link FileIO} instance to use when reading data files for this scan
*/
default FileIO io() {
throw new UnsupportedOperationException("io() is not implemented: added in 1.11.0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the version belongs here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for unexpected implementations, so I was thinking that a version number would be helpful like a deprecation message that has one. I'm fine removing it, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling here, are you thinking of this as a implementer facing message? I assume they won't really care when it's added, it's not like they can upgrade there way out of it. So i'll take it either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it could address urgent questions if you have deployed a new version of Iceberg and hit this, like "when was this added so I can roll back?" or "how long has this been broken?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who would that message be for though? I would think it's only for library integrators who would hopefully break immediately when they bump their dependency. I really don't mind though. We can keep it

@nastra nastra force-pushed the add-fileio-to-scan branch from 05711a7 to f9810f7 Compare March 11, 2026 08:52
@nastra nastra added this to the Iceberg 1.11.0 milestone Mar 11, 2026
@nastra nastra force-pushed the add-fileio-to-scan branch from 4e25802 to 2d1bfa0 Compare March 11, 2026 16:04
ImmutableMap.Builder<String, String> builder =
ImmutableMap.<String, String>builder().putAll(catalogProperties);
if (null != planId) {
builder.put(RESTCatalogProperties.REST_SCAN_PLAN_ID, planId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no sense to me that the ID that we use to internally track the plan ID is a public field where we keep properties to configure the REST catalog. Is this something we can change or has it been released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hasn't been released and we can still change it, but this should most likely be done in a separate PR because that property is being used in a few other places as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing in a separate PR is fine, but we don't want to replace one blocker with another endlessly as we find these issues.

We should also consider whether there are any alternatives to passing this mixed into catalog properties. Passing state like this in a property map along with config mixes concepts and causes weird API additions like this constant in RESTCatalogProperties.

private final FileIO tableIO;
private String planId = null;
private FileIO fileIOForPlanId = null;
private FileIO scanFileIO = null;
Copy link
Contributor

@rdblue rdblue Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table's FileIO is not needed. It is passed in only once as table.io() and table is also passed in. And now that I'm looking at it, table not needed as a field because a parent (BaseScan) exposes table(). Both fields should be removed, along with the constructor argument for the table's IO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the information passed around in this class more based on having a couple of unnecessary fields and I have a few more questions:

Why does this use Map<String, String> headers rather than Supplier<Map<String, String>> headers that the table has? Doesn't this mean that the headers from the auth session are static? How does credential refresh work if the authentication header is stale?

Was the TableOperations field originally used? It isn't used now, other than to pass it to refined scans. I think it should be removed from both the constructor and the fields.

This passes ResourcePaths and TableIdentifier to build 2 paths that could be passed in a 1 path using the plan ID. I think this should be reconsidered. The plan and fetch endpoints can be passed in as simple strings, and it would simplify this to use a Function<String, String> to construct the plan path with ID. Another option is to create an object similar to ResourcePaths that is specific to a table (hides TableIdentifier) and use that. Either one would be cleaner.

This also passes supportedEndpoints, which spreads out the logic for how to handle endpoints that aren't supported. These endpoints are also checked when needed, so this code will create a plan request and could then fail to fetch tasks if the fetch endpoint isn't supported. It also throws error messages that are not helpful, like "Server does not support endpoint: GET /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}" instead of "Invalid status: submitted (service does not support async planning)" that would be more helpful.

A cleaner way to handle endpoints is to verify required endpoints before creating this scan. Both the plan and fetch endpoints should be required. Then the optional endpoint should be booleans, like supportsAsync and supportsCancel.

Last, this leaks catalog properties and the Hadoop conf so that CatalogUtil.loadFileIO can be called with List<Credential>. Why not pass a function to create the FileIO so that the properties and conf are contained in the REST table operations?

I think this works right now and these aren't blockers (other than the potential auth session issue), but I would really like to see this class simplified by reducing the number of things that have to be passed to it and remove some of the things that are done here, like handling endpoint checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @singhpk234 for opening #15595 to address those things

.withUseSnapshotSchema(true);
} else if (snapshotId != null) {
boolean useSnapShotSchema = snapshotId != table.currentSnapshot().snapshotId();
boolean useSnapShotSchema = snapshotId != table().currentSnapshot().snapshotId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a blocker, but it doesn't look correct. Additional nit: "snapshot" is one word and should be capitalized "Snapshot".

@singhpk234, The snapshot schema should be used when time traveling to a tag or a specific snapshot ID, but not when reading from a branch. That context comes from how the ref or snapshot was configured. Choosing a specific snapshot should generally send useSnapshotSchema=true, but just reading from a branch should not. Here is the description from the REST spec:

Whether to use the schema at the time the snapshot was written. When time travelling, the snapshot schema should be used (true). When scanning a branch, the table schema should be used (false).

This comparison isn't going to be sufficient because the distinction is whether the snapshot was selected via branch name vs directly by ID or by tag name. This needs to know whether useRef was called and whether the ref was a branch.

We also can't rely on comparing the result schema because DataTableScan (and its children) always override the schema to the branch/tag/snapshot schema and when useSnapshot or useRef are called. That's because the fields passed to select are applied lazily in planFiles. The snapshot/ref selection changes the base schema and columns are selected by name, or a specific projection is used directly.

To fix this, I think you need to override some of the API methods to detect how the scan is configured.

And while we're looking at schema projection, I think the projection code is also incorrect:

    List<String> selectedColumns =
        schema().columns().stream().map(Types.NestedField::name).collect(Collectors.toList());

The problem here is that it will select top-level fields only because NestedField#name returns the local field's name and not any children. To get the full field names, you'd need to call TypeUtil.getProjectedIds and then pass each ID through schema().findColumnName(id) like you do for stats fields. This is what SnapshotScan does:

    List<Integer> projectedFieldIds = Lists.newArrayList(TypeUtil.getProjectedIds(schema()));
    List<String> projectedFieldNames =
        projectedFieldIds.stream().map(schema()::findColumnName).collect(Collectors.toList());


// make sure remote scan planning is called and FileIO gets the planId
assertThat(tableScan.planFiles()).hasSize(1);
assertThat(table.io().properties()).doesNotContainKey(RESTCatalogProperties.REST_SCAN_PLAN_ID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted that mixing the plan ID into properties is not a great solution above. And I want to point out that this assertion is necessary because of it. We have to make sure we're not modifying the wrong property map and worry about cases where user-driven config passes in a hard-coded plan ID (see buildKeepingLast()). I don't know that there's an easy fix, but this is not a pattern we want in the codebase.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are fine in isolation, although this made me look into schema handling and I found a couple of bugs to be fixed in a follow-up. We also need to move the plan ID constant out of public catalog properties.

@nastra
Copy link
Contributor Author

nastra commented Mar 13, 2026

thanks for the reviews @RussellSpitzer and @rdblue. I'll get this merged so that we can address the other items in separate PRs

@nastra nastra merged commit cf6f835 into apache:main Mar 13, 2026
35 checks passed
@nastra nastra deleted the add-fileio-to-scan branch March 13, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants