[FLINK-29549]- Flink Glue Catalog integration #191

FranMorilloAWS · 2025-03-03T10:17:17Z

Purpose of the change

For example: Implements the Table API for the Kinesis Source.

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment
Added unit tests
Manually verified by running the Kinesis connector on a local Flink cluster.

Significant changes

(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterwards, for convenience.)

Dependencies have been added or upgraded
Public API has been changed (Public API is any class annotated with @Public(Evolving))
Serializers have been changed
New feature has been introduced
- If yes, how is this documented? (not applicable / docs / JavaDocs / not documented)

boring-cyborg · 2025-03-03T10:17:20Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

leekeiabstraction

Provided early comments/questions

leekeiabstraction · 2025-04-08T16:52:31Z

flink-catalog-aws-glue/README.md

+Access tables from different catalogs in the same query:
+
+```sql
+-- Join tables from different catalogs


This example joins on customer id match, IIUC, glue shouldn't keep actual values but rather metadata on schema. Should we have an example that joins on a more concrete example e.g. field name match?

leekeiabstraction · 2025-04-08T17:00:46Z

flink-catalog-aws-glue/dependency-reduced-pom.xml

+    <plugins>
+      <plugin>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <version>3.8.1</version>


Is there a motivation for specifying 3.8.1 here? (Similarly for shade plugin and flink dependencies)

The pom.xml at repo root already defines build plugins. Should we point to pom.xml at repo root as parent and rely on convention over configuration?

<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> </plugin> ```

Also applies to flink-catalog-aws-glue/pom.xml

+1 - let's delegate this to the project setup on flink-connector-aws

leekeiabstraction · 2025-04-08T17:02:15Z

flink-catalog-aws-glue/pom.xml

+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-json</artifactId>
+            <version>1.18.0</version>


Why do we use 1.18 here over ${flink.version}?

Similar comments over the rest of this pom.xml on flink packages, jackson databind, connector etc.

leekeiabstraction · 2025-04-08T17:02:39Z

flink-catalog-aws-glue/pom.xml

+        </dependency>
+        <dependency>
+            <groupId>software.amazon.awssdk</groupId>
+            <artifactId>bedrockruntime</artifactId>


Can you elaborate why bedrockruntime is needed?

leekeiabstraction · 2025-04-08T17:07:00Z

flink-catalog-aws-glue/src/main/java/com/amazonaws/services/msf/GlueCatalogFactory.java

+
+    // Define configuration options that users must provide
+    public static final ConfigOption<String> REGION =
+            ConfigOptions.key("region")


Should we have qualifier/prefix so that it's clear that the config is for GlueCatalogConnector?

For example:

catalog.glue.region catalog.glue.default-database

They don't to that distinction for the other Catalogs

https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/connectors/table/jdbc/#jdbc-catalog

As they would need to put as Catalog type: glue, not sure if it makes sense to be repetitive and add glue to all configs

leekeiabstraction · 2025-04-08T17:21:13Z

...log-aws-glue/src/main/java/com/amazonaws/services/msf/operations/GlueDatabaseOperations.java

+            return true;
+        } catch (EntityNotFoundException e) {
+            return false;
+        } catch (Exception e) {


Let's use more specific exception here

leekeiabstraction · 2025-04-08T17:23:08Z

...og-aws-glue/src/main/java/com/amazonaws/services/msf/operations/GlueFunctionsOperations.java

+     * @param functionPath fully qualified function path
+     * @throws CatalogException In case of Unexpected errors.
+     */
+    public void dropGlueFunction(ObjectPath functionPath) throws CatalogException {


It seems like we do not throw CatalogException within the method, do we need to catch, wrap and rethrow?

...atalog-aws-glue/src/main/java/com/amazonaws/services/msf/operations/GlueTableOperations.java

flink-catalog-aws-glue/src/main/java/com/amazonaws/services/msf/util/ConnectorRegistry.java

leekeiabstraction · 2025-04-08T17:34:52Z

flink-catalog-aws-glue/src/main/java/com/amazonaws/services/msf/util/GlueFunctionsUtil.java

+            case JAVA:
+                return GlueCatalogConstants.FLINK_JAVA_FUNCTION_PREFIX + function.getClassName();
+            case SCALA:
+                return GlueCatalogConstants.FLINK_SCALA_FUNCTION_PREFIX + function.getClassName();
+            case PYTHON:
+                return GlueCatalogConstants.FLINK_PYTHON_FUNCTION_PREFIX + function.getClassName();


It seems like we are using a self defined pattern for class names

public static final String FLINK_SCALA_FUNCTION_PREFIX = "flink:scala:"; public static final String FLINK_PYTHON_FUNCTION_PREFIX = "flink:python:"; public static final String FLINK_JAVA_FUNCTION_PREFIX = "flink:java:";

Documentation on glue's UserDefinedFunction actually says The Java class that contains the function code. . Should we use Java namespace + class name format instead?

https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/glue/model/UserDefinedFunction.html#className()

Samrat002

Hi @FranMorilloAWS,

It's great to see efforts toward integrating AWS Glue Catalog with Flink. I previously submitted a pull request that implements proposed FLIP-277: apache/flink-connector-aws#47.

I have a couple of questions:

Could you please share the reasons for not working on the existing work in PR [FLINK-30481][FLIP-277] GlueCatalog Implementation #47? What restriction led you to take over the existing ongoing work and put repeated effort for the same?
I noticed that significant portions of the code copied from PR [FLINK-30481][FLIP-277] GlueCatalog Implementation #47. While I'm happy to assist with the review, it would be in the collaborative spirit of open source to retain the original commits when incorporating code from previous contributions.

hlteoh37

Thanks for the contribution @FranMorilloAWS It's really nice to see detailed docs and well thought through APIs!

Added some comments around the structure of the repo - will continue looking at the code structure

flink-catalog-aws-glue/README.md

hlteoh37 · 2025-04-15T13:51:40Z

flink-catalog-aws-glue/dependency-reduced-pom.xml

+    <plugins>
+      <plugin>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <version>3.8.1</version>


+1 - let's delegate this to the project setup on flink-connector-aws

flink-catalog-aws-glue/dependency-reduced-pom.xml

flink-catalog-aws-glue/pom.xml

Samrat002

Thank you @FranMorilloAWS .

I have done a very high level pass .

Key Notices :

glue client offers paginated requests. these are generally helpful for responses that has very large number or records. for example listing tables . incorporate those changes
Missing Unit test for type conversion
IMO , this pr can be reduced to scope of database and tables. function , partition and other features like stats can be part of future pr.

Cheers, Samrat

...k-catalog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/GlueCatalogFactory.java

flink-catalog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/StreamingJob.java

...lue/src/main/java/org/apache/flink/table/catalog/glue/operations/GlueDatabaseOperations.java

Samrat002 · 2025-04-15T18:10:51Z

...ue/src/main/java/org/apache/flink/table/catalog/glue/operations/GlueFunctionsOperations.java

+
+/** Utilities for Glue catalog Function related operations. */
+@Internal
+public class GlueFunctionsOperations extends AbstractGlueOperations {


IMO, function support can be part of followup pr. WDUT ?

Unfortunately Functions are needed for being able to select specific columns

...s-glue/src/main/java/org/apache/flink/table/catalog/glue/operations/GlueTableOperations.java

...talog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/util/GlueTypeConverter.java

flink-catalog-aws-glue/src/main/resources/log4j2.properties

...talog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/util/GlueTypeConverter.java

Samrat002 · 2025-04-21T06:05:56Z

...talog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/util/GlueTypeConverter.java

+            throw new IllegalArgumentException("Glue type cannot be null or empty");
+        }
+
+        // Trim but don't lowercase - we'll handle case-insensitivity per type


Why case-insensitivity is handled per type ?
Isn't handling case-insensitivity increases code complexity.

This approach is necessary because:
Glue types like "string", "int", "boolean" should be matched case-insensitively (i.e., "STRING" and "string" are the same type)
It matches AWS Glue's behavior, where type names are case-insensitive

flink-catalog-aws-glue/src/test/java/org/apache/flink/table/catalog/glue/StreamingJobTest.java

Samrat002 · 2025-04-21T07:09:00Z

flink-catalog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/GlueCatalog.java

+        // Create a synchronized client builder to avoid concurrent modification exceptions
+        this.glueClient = GlueClient.builder()
+                .region(Region.of(region))
+                .credentialsProvider(software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider.create())


catalog will only work for DefaultCredentialsProvider.

Add other modes with aws.credentials.provider

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235836075#FLIP277:NativeGlueCatalogSupportinFlink-Configurations

Glue client should atleast support

glueendpoint

httpClient

different credential configuration

not supporting these minimal requirement may be create constraints in adoption

Can we add those after first release?

flink-catalog-aws-glue/src/main/java/org/apache/flink/table/catalog/glue/GlueCatalog.java

Samrat002 · 2025-04-23T04:46:05Z

...lue/src/main/java/org/apache/flink/table/catalog/glue/operations/GlueDatabaseOperations.java

+            GetDatabaseResponse response = glueClient.getDatabase(
+                    GetDatabaseRequest.builder()
+                            .name(databaseName)
+                            .build()
+            );


validateDatabaseName only does the pattern match using !VALID_NAME_PATTERN.matcher(databaseName).matches()) .

I am more concerned about

When a user tries to run SHOW CREATE DATABASE MyDataBase, which database will be returned? MyDatabase or myDatabase or any other varies in casing?

What will the SHOW DATABASES output look like?

I don't see any mechanism in the code that can identify this anomaly.

Here is what you can do to fix this problem

Add an identifier before the character in the name to mark that the character is uppercase.

myDatabase will translate to -my-database. character succeeding identifier - will be interpreted as upper case.

FranMorilloAWS · 2025-04-25T19:37:21Z

@Samrat002 Considering that glue will lower case all databases, we shouldnt allow users to create databases with uppercase, therefore if they try to do show Create database they must give a database in lowercase if not the command will fail. I added additional tests in Glue Catalog and GlueDatabaseOperations Tests to show this

leekeiabstraction reviewed Apr 8, 2025

View reviewed changes

Samrat002 suggested changes Apr 9, 2025

View reviewed changes

hlteoh37 requested changes Apr 15, 2025

View reviewed changes

Samrat002 suggested changes Apr 15, 2025

View reviewed changes

Samrat002 suggested changes Apr 21, 2025

View reviewed changes

Rebased Commit

b1f2e33

FranMorilloAWS force-pushed the flink-glue-integration branch from 1725a0f to b1f2e33 Compare April 21, 2025 13:07

Samrat002 suggested changes Apr 23, 2025

View reviewed changes

Adding additional test for case sensitivity for database operations

e31b207

fixed checkstyle violations

fe231ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-29549]- Flink Glue Catalog integration #191

[FLINK-29549]- Flink Glue Catalog integration #191

FranMorilloAWS commented Mar 3, 2025

boring-cyborg bot commented Mar 3, 2025

leekeiabstraction left a comment

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

hlteoh37 Apr 15, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

FranMorilloAWS Apr 9, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

leekeiabstraction Apr 8, 2025

Samrat002 left a comment •

edited

Loading

hlteoh37 left a comment

hlteoh37 Apr 15, 2025

Samrat002 left a comment

Samrat002 Apr 15, 2025

FranMorilloAWS Apr 17, 2025

Samrat002 Apr 21, 2025

FranMorilloAWS Apr 21, 2025

Samrat002 Apr 21, 2025

Samrat002 Apr 21, 2025

FranMorilloAWS Apr 21, 2025

Samrat002 Apr 23, 2025

FranMorilloAWS commented Apr 25, 2025

[FLINK-29549]- Flink Glue Catalog integration #191

Are you sure you want to change the base?

[FLINK-29549]- Flink Glue Catalog integration #191

Conversation

FranMorilloAWS commented Mar 3, 2025

Purpose of the change

Verifying this change

Significant changes

boring-cyborg bot commented Mar 3, 2025

leekeiabstraction left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samrat002 left a comment • edited Loading

Choose a reason for hiding this comment

hlteoh37 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samrat002 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FranMorilloAWS commented Apr 25, 2025

Samrat002 left a comment •

edited

Loading