-
Notifications
You must be signed in to change notification settings - Fork 237
[SPEC] Add location keyword for GenericTable API #1543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -199,6 +199,8 @@ components: | |||
type: object | |||
additionalProperties: | |||
type: string | |||
location: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we move it above the field properties
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated
@@ -199,6 +199,8 @@ components: | |||
type: object | |||
additionalProperties: | |||
type: string | |||
location: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is optional, it is not part of the required fields https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/generic-tables-api.yaml#L215
@@ -212,6 +214,8 @@ components: | |||
- `properties` properties for the generic table passed on creation | |||
|
|||
- `doc` comment or description for the generic table | |||
|
|||
- `location` location for the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this location mean? Is it supposed to point to a particular file? Is it supposed to be the common prefix of the locations of all files within the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dimas-b i want to resume the discussion. The locations refers to the table root location in an URI format, I also updated the comment.
I am also copy some of the discussion point over from the email thread
1) Shall we introduce an explicit definition of location, and what is `location`?
The location refers to the root location of the table.
The table root location is a required information for different engines to access the table with formats like Delta, CSV etc. It is important that we explicitly define this information to provide robust cross engine interpolation. Furthermore, it is also an important information that is needed for credential vending.
2) Do we support single table root location or multiple root location ?
Today, only the Iceberg table allows multiple root locations, other table formats including Delta, Hudi and Hive style tables (CSV, Parquet) only support single table root location. Since Generic Table is not designed to serve Iceberg functionality today, there is no use case for multiple table root locations, and starting with single location should be sufficient.
If in the future, we want to repurpose Generic table to also support Iceberg capabilities, and encounters the following use case:
"people may want to move their data to some other location. As an option users may want to write new
files into another location but keep old files in place"
One option we can do is to introduce an extra `additional location` to record the
old data locations, and it can make a clear separation there about which is the current location, which are the
old data locations. Similar as what glue has been introducing. Or another option is to move on for V2 spec.
Glue Table Catalog: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table
3) Should the location allow all URI schemas and special characters, especially s3a, s3n?
There are various issues raised in the Iceberg community when dealing with all those S3 schemas during path matching. However, since the root table location is not an absolute path, and Generic Table has restricted support in short and mid term (Polaris is still promoting for native Iceberg support), it will not do any complicated matching operation with the path, allowing all schemas shouldn't cause those issues like iceberg community. Furthermore, people may uses other schema with specific reasons such as performance or engine limitation.
4) Should the location be an explicit field or a reserved property key?
Given that table root location is an important information for most of the non-iceberg table formats. Having an explicit field could make things more clear when sharing across engines.
@dimas-b Could you take a look ? please let me know if you have other concern points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very hesitant to accept that all table files always have a common base URI 🤔
From my POV it is preferable to resolve these concerns on the dev ML, then resume this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of an Iceberg table where files are outside the "location" URI: projectnessie/nessie#10817 (comment)
I'm pretty sure Generic Tables will run into similar situations eventually, so we need to be prepared to deal with them.
If the "location" property is singular, we should put clean language in the spec to indicate that no table files should be out the location. Even in that case, I think we ought to expect user demand for multiple locations at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dimas-b Thanks a lot for the information! Is that for Iceberg tables, from the description, it does seems for Iceberg tables. I think for iceberg tables, we all agree that it can have multiple table locations, in fact, @dennishuo is looking into proposing new "locations" fields for Iceberg. Since generic table is not designed for iceberg usage today, so far i don't see it is necessary.
I am also open to have locations instead of location to accommodate future possibilities, I am little bit worried that no one will need it in such way, and it could introduce potential confusion to users.
An alternative we can do is if in the future if we encounter table formats that could allow multiple locations, we can introduce extra reserved property key to record extra locations for those table formats.
WDYT? i will also sent it on the mailing thread as a record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue this on the email thread.
5deacb4
to
774cf34
Compare
Add location keyword for GenericTable create request and GenericTable load response.