Skip to content

Conversation

@shubhampatel28
Copy link

What is the purpose of the pull request

  • This PR introduces a new Spark catalog plugin that provides unified access to both Hudi and Iceberg tables using AWS Glue as the metastore. The plugin automatically detects table formats and delegates operations to the appropriate catalog implementation (Iceberg's SparkCatalog or Hudi's HoodieCatalog), enabling seamless querying of mixed table formats through a single catalog interface.

Brief change log

  • Added new module xtable-spark-plugin to parent POM
  • Implemented XTableSparkCatalog class with unified catalog interface for Hudi and Iceberg tables
  • Added automatic table format detection using multiple fallback strategies:
  • Integrated AWS Glue client for table metadata lookup with cross-account support

Verify this pull request

Added TestXTableSparkCatalog for unit tests

<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-${spark.version.prefix}_${scala.binary.version}</artifactId>
<version>${iceberg.version}</version>
<scope>compile</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this as compile, or can we mark them as provided?

<!-- Hudi Spark bundle for HoodieCatalog -->
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-spark${spark.version.prefix}-bundle_${scala.binary.version}</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar for this comment

this.glueClient = buildGlueClient(options);

SparkSession spark = SparkSession.active();
spark.conf().set("hoodie.schema.on.read.enable", "true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what was the historical reason for setting this, I think my draft had this but curious what was the original exception?


private GlueClient buildGlueClient(CaseInsensitiveStringMap options) {
String region =
options.getOrDefault("glue.region", options.getOrDefault("aws.region", "us-west-2"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be explicit and if this is not set have the user pass the region property via spark shell, etc

}

// Pass catalog-id for cross-account Glue access if specified
if (options.containsKey("catalog-id")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already checked in buildGlueClient right? Maybe we can remove the log from build glueClient?

try {
SparkSession spark = SparkSession.active();
CatalogPlugin sparkCatalog = spark.sessionState().catalogManager().catalog("spark_catalog");
hudiCatalog.setDelegateCatalog(sparkCatalog);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require to setDelegateCatlaog, my assumption is this would be the default, maybe in your testing you can try removing it and see what happens.


Map<String, String> hudiOptions = new HashMap<>(options);

hudiOptions.put("provider", "hudi");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also try removing these properties I think it should be handled by default.

String tableFormat = TableFormatUtils.getTableFormat(parameters);
LOG.debug("Detected table format '{}' for table: {}", tableFormat, ident);
return tableFormat;
} catch (IllegalArgumentException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave comment here since this is basically the case where during intial creation the table does not exist, hence why we are catching this exception. I wonder though if instead it would be better to catch the EntityNotFoundException, and in that case you might not be using the util.

}

@VisibleForTesting
boolean isHudiFormat(String inputFormat, String outputFormat, String serdeLib) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During your testing can you check the AWS Glue console to see what the table and storage properties are for hudi and iceberg table, just to make sure how we match on a table is not fickle.

break;

default:
LOG.info("No specific format specified, defaulting to Iceberg for table: {}", ident);
Copy link
Contributor

@rahil-c rahil-c Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead be throwing an exception here, as im not sure if we are supporting other formats for this current testing

@shubhampatel28 shubhampatel28 marked this pull request as draft October 27, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants