Skip to content

[SPEC] Add location keyword for GenericTable API #1543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gh-yzou
Copy link
Contributor

@gh-yzou gh-yzou commented May 7, 2025

Add location keyword for GenericTable create request and GenericTable load response.

flyrain
flyrain previously approved these changes May 7, 2025
@@ -199,6 +199,8 @@ components:
type: object
additionalProperties:
type: string
location:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move it above the field properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 7, 2025
@@ -199,6 +199,8 @@ components:
type: object
additionalProperties:
type: string
location:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -212,6 +214,8 @@ components:
- `properties` properties for the generic table passed on creation

- `doc` comment or description for the generic table

- `location` location for the table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this location mean? Is it supposed to point to a particular file? Is it supposed to be the common prefix of the locations of all files within the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dimas-b i want to resume the discussion. The locations refers to the table root location in an URI format, I also updated the comment.

I am also copy some of the discussion point over from the email thread

1) Shall we introduce an explicit definition of location, and what is `location`?
The location refers to the root location of the table.
The table root location is a required information for different engines to access the table with formats like Delta, CSV etc. It is important that we explicitly define this information to provide robust cross engine interpolation. Furthermore, it is also an important information that is needed for credential vending.

2) Do we support single table root location or multiple root location ?
Today, only the Iceberg table allows multiple root locations, other table formats including Delta, Hudi and Hive style tables (CSV, Parquet) only support single table root location.  Since Generic Table is not designed to serve Iceberg functionality today, there is no use case for multiple table root locations, and starting with single location should be sufficient.

If in the future, we want to repurpose Generic table to also support Iceberg capabilities, and encounters the following use case:
"people may want to move their data to some other location. As an option users may want to write new
files into another location but keep old files in place"

One option we can do is to introduce an extra `additional location` to record the 
old data locations, and it can make a clear separation there about which is the current location, which are the
old data locations. Similar as what glue has been introducing. Or another option is to move on for V2 spec. 

Glue Table Catalog: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table

3) Should the location allow all URI schemas and special characters, especially s3a, s3n?
There are various issues raised in the Iceberg community when dealing with all those S3 schemas during path matching. However, since the root table location is not an absolute path, and Generic Table has restricted support in short and mid term (Polaris is still promoting for native Iceberg support), it will not do any complicated matching operation with the path, allowing all schemas shouldn't cause those issues like iceberg community. Furthermore, people may uses other schema with specific reasons such as performance or engine limitation.

4) Should the location be an explicit field or a reserved property key?
Given that table root location is an important information for most of the non-iceberg table formats. Having an explicit field could make things more clear when sharing across engines. 

@dimas-b Could you take a look ? please let me know if you have other concern points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very hesitant to accept that all table files always have a common base URI 🤔

From my POV it is preferable to resolve these concerns on the dev ML, then resume this PR.

Copy link
Contributor

@dimas-b dimas-b May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of an Iceberg table where files are outside the "location" URI: projectnessie/nessie#10817 (comment)

I'm pretty sure Generic Tables will run into similar situations eventually, so we need to be prepared to deal with them.

If the "location" property is singular, we should put clean language in the spec to indicate that no table files should be out the location. Even in that case, I think we ought to expect user demand for multiple locations at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dimas-b Thanks a lot for the information! Is that for Iceberg tables, from the description, it does seems for Iceberg tables. I think for iceberg tables, we all agree that it can have multiple table locations, in fact, @dennishuo is looking into proposing new "locations" fields for Iceberg. Since generic table is not designed for iceberg usage today, so far i don't see it is necessary.

I am also open to have locations instead of location to accommodate future possibilities, I am little bit worried that no one will need it in such way, and it could introduce potential confusion to users.
An alternative we can do is if in the future if we encounter table formats that could allow multiple locations, we can introduce extra reserved property key to record extra locations for those table formats.

WDYT? i will also sent it on the mailing thread as a record

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue this on the email thread.

@gh-yzou gh-yzou marked this pull request as draft May 8, 2025 17:40
@gh-yzou gh-yzou force-pushed the yzou-generic-table-location branch from 5deacb4 to 774cf34 Compare May 14, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants