-
Couldn't load subscription status.
- Fork 188
Add XTable Spark Catalog Plugin for unified Hudi/Iceberg table access #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add XTable Spark Catalog Plugin for unified Hudi/Iceberg table access #752
Conversation
| <groupId>org.apache.iceberg</groupId> | ||
| <artifactId>iceberg-spark-runtime-${spark.version.prefix}_${scala.binary.version}</artifactId> | ||
| <version>${iceberg.version}</version> | ||
| <scope>compile</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this as compile, or can we mark them as provided?
| <!-- Hudi Spark bundle for HoodieCatalog --> | ||
| <dependency> | ||
| <groupId>org.apache.hudi</groupId> | ||
| <artifactId>hudi-spark${spark.version.prefix}-bundle_${scala.binary.version}</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar for this comment
| this.glueClient = buildGlueClient(options); | ||
|
|
||
| SparkSession spark = SparkSession.active(); | ||
| spark.conf().set("hoodie.schema.on.read.enable", "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me what was the historical reason for setting this, I think my draft had this but curious what was the original exception?
|
|
||
| private GlueClient buildGlueClient(CaseInsensitiveStringMap options) { | ||
| String region = | ||
| options.getOrDefault("glue.region", options.getOrDefault("aws.region", "us-west-2")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be explicit and if this is not set have the user pass the region property via spark shell, etc
| } | ||
|
|
||
| // Pass catalog-id for cross-account Glue access if specified | ||
| if (options.containsKey("catalog-id")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already checked in buildGlueClient right? Maybe we can remove the log from build glueClient?
| try { | ||
| SparkSession spark = SparkSession.active(); | ||
| CatalogPlugin sparkCatalog = spark.sessionState().catalogManager().catalog("spark_catalog"); | ||
| hudiCatalog.setDelegateCatalog(sparkCatalog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we require to setDelegateCatlaog, my assumption is this would be the default, maybe in your testing you can try removing it and see what happens.
|
|
||
| Map<String, String> hudiOptions = new HashMap<>(options); | ||
|
|
||
| hudiOptions.put("provider", "hudi"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also try removing these properties I think it should be handled by default.
| String tableFormat = TableFormatUtils.getTableFormat(parameters); | ||
| LOG.debug("Detected table format '{}' for table: {}", tableFormat, ident); | ||
| return tableFormat; | ||
| } catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave comment here since this is basically the case where during intial creation the table does not exist, hence why we are catching this exception. I wonder though if instead it would be better to catch the EntityNotFoundException, and in that case you might not be using the util.
| } | ||
|
|
||
| @VisibleForTesting | ||
| boolean isHudiFormat(String inputFormat, String outputFormat, String serdeLib) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During your testing can you check the AWS Glue console to see what the table and storage properties are for hudi and iceberg table, just to make sure how we match on a table is not fickle.
| break; | ||
|
|
||
| default: | ||
| LOG.info("No specific format specified, defaulting to Iceberg for table: {}", ident); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead be throwing an exception here, as im not sure if we are supporting other formats for this current testing
What is the purpose of the pull request
Brief change log
Verify this pull request
Added TestXTableSparkCatalog for unit tests