Skip to content

Spark 4.1: Pass FileIO on Spark's read path#15448

Open
nastra wants to merge 3 commits intoapache:mainfrom
nastra:remote-planning-file-io-alternative
Open

Spark 4.1: Pass FileIO on Spark's read path#15448
nastra wants to merge 3 commits intoapache:mainfrom
nastra:remote-planning-file-io-alternative

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Feb 26, 2026

When accessing/reading data files, the codebase is using the Table's FileIO instance through table.io() on Spark's read path. With remote scan planning the FileIO instance is configured with a PlanID + custom storage credentials inside RESTTableScan, but that instance is never propagated to the place(s) that actually perform the read., thus leading to errors.

This PR passes the FileIO obtained during remote/distributed scan planning next to the Table instance on Spark's read path.

This is an alternative to #15368 and requires SerializableFileIOWithSize, which makes sure that the FileIO instance is only closed on the driver and not on executor nodes (similar to SerializableTableWithSize).

@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 000673a to dc8a4f7 Compare February 26, 2026 10:44
@nastra nastra marked this pull request as draft March 3, 2026 09:00
CatalogProperties.CLIENT_POOL_SIZE,
"1"));
"1",
"include-credentials",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the server to return some dummy credentials so that we can properly test that FileIO with the planID + storage credentials is propagated in TestRESTScanPlanning

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class SerializableFileIOWithSize
Copy link
Contributor Author

@nastra nastra Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sure that the FileIO instance is only closed on the driver and not on executor nodes and is similar to SerializableTableWithSize

Copy link
Contributor

@rdblue rdblue Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? What is the advantage? I don't recall why we had to do it with table, so you may want to check with @aokolnychyi.

I would highly prefer not adding an extra class unless it is definitely necessary.

Copy link
Contributor Author

@nastra nastra Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializableTableWithSize was added by @bryanck back then and I was initially testing the code changes without this class but it is definitely needed to avoid closing FileIO on executor nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7263 is the PR that moved away from serializing the FileIO. The problem is that Spark closed the FileIO as part of broadcast cleanup, which closed the shared S3 client.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with adding this class; I think we ultimately do need to preserve size estimation property to prevent any regressions and I think we definitley need to preserve making sure the driver fileIO is not closed during broadcast cleanup. If we agree that we need to preserve these properties, and it looks like there are at least 2 places where now we want to broadcast the FileIO , I think there's a good argument to have a wrapper class. If we manage to shrink down what we need to send over the wire maybe there's a simpler structure but whatever we broadcast probably does need to implement FileIO which brings along all the implementations to satisfy that interface.

My only question is does the class need to be public? It looks like it's only used in this module so I think it could be package private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Package private, I'm not as clear on the benefit of handling this Size Estimate, but the closing behavior seems important to match. Feels like Spark should have an easier method though for doing this, like broadcast(var, estimatedSize) or something to avoid us having to implement an interface.

Copy link
Member

@RussellSpitzer RussellSpitzer Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a thought, what's stopping us from doing something like

    SerializableTableWithSize.copyOf(table, scan.io())); 

Where we use the table object and just attach the scan io to it?

@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 45a44aa to 122ccc8 Compare March 6, 2026 17:30
@SuppressWarnings("unchecked")
private <T extends RESTResponse> T maybeAddStorageCredential(T response) {
if (response instanceof PlanTableScanResponse resp
&& PlanStatus.COMPLETED == resp.planStatus()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into when we can return a FileIO from the scan and I was surprised to see that storage-credentials is returned for any CompletedPlanningResult rather than CompletedPlanningWithIDResult. The one with ID is used for completed responses from the plan endpoint, while the generic result is returned from both plan and fetch endpoints.

Why can we return storage credentials when fetching tasks? Wouldn't it make more sense to return credentials once per planning operation? That simplifies when we know we have credentials.

That wouldn't solve the problem of needing to wait until after planFiles is called to get the FileIO, but it would at least simplify the protocol so we don't need to try to create a new FileIO after each call to fetch more tasks. (@danielcweeks, any thoughts on this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we want to change this we would at the very least need to update the Spec that we added in #14563

SparkScan(
SparkSession spark,
Table table,
Supplier<FileIO> fileIO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a supplier because it will be a broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a supplier because planFiles() will be called later in the read path and so the actual FileIO with the right credentials will only be available later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually considered using Supplier<FileIO> for the API from scan. Now that it's clear that we will create a supplier anyway, should we just update the scan interface so that the caller doesn't need to wrap it?

The nice thing about that is that we don't rely on docs or runtime exceptions (unless you call get too early). Returning a supplier signals to the caller that they should find out when the FileIO is accessible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think that makes sense. I've created #15646 for that

@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 122ccc8 to 4a69352 Compare March 9, 2026 11:34
@github-actions github-actions bot removed the OPENAPI label Mar 9, 2026
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 4a69352 to f9e1bde Compare March 9, 2026 13:22
@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from c13cd0a to b5367d7 Compare March 12, 2026 07:04
@nastra nastra marked this pull request as ready for review March 12, 2026 07:06
@nastra nastra force-pushed the remote-planning-file-io-alternative branch 2 times, most recently from 3e3ec04 to 55d2b57 Compare March 13, 2026 05:43
@github-actions github-actions bot removed the API label Mar 13, 2026
@nastra nastra requested a review from rdblue March 13, 2026 05:44
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 55d2b57 to d8d6cb7 Compare March 13, 2026 07:56
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from d8d6cb7 to 456b1d5 Compare March 13, 2026 15:01
@nastra nastra changed the title API, Core, Spark: Pass FileIO on Spark's read path Spark 4.1: Pass FileIO on Spark's read path Mar 13, 2026
@singhpk234 singhpk234 added this to the Iceberg 1.11.0 milestone Mar 13, 2026
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 456b1d5 to bce38ca Compare March 16, 2026 08:12
@nastra nastra force-pushed the remote-planning-file-io-alternative branch from bce38ca to 1d5a7bb Compare March 16, 2026 09:43

@Override
public long estimatedSize() {
return SIZE_ESTIMATE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size estimate of what? The FileIO's serialized representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a hardcoded size so that sizes don't have to be re-calculated (similar to how we have it in SerializableTableWithSize)

@nastra nastra force-pushed the remote-planning-file-io-alternative branch from 1d5a7bb to 33733d7 Compare March 18, 2026 04:44
@github-actions github-actions bot added the API label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants