[improve][io] Support Protobuf schema for Kafka source connector#23954
[improve][io] Support Protobuf schema for Kafka source connector#23954jiangpengcheng wants to merge 13 commits intoapache:masterfrom
Conversation
|
/pulsarbot rerun-failure-checks |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23954 +/- ##
============================================
+ Coverage 73.57% 74.20% +0.63%
+ Complexity 32624 32438 -186
============================================
Files 1877 1863 -14
Lines 139502 144332 +4830
Branches 15299 16467 +1168
============================================
+ Hits 102638 107104 +4466
+ Misses 28908 28771 -137
- Partials 7956 8457 +501
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Hi @lhotari, could u help to review this pr, thx~ |
ba65b6b to
07d4919
Compare
|
@jiangpengcheng I have resolved the merge conflict after #24201 changes which upgraded Confluent Platform version to 7.8.2 and Kafka client version to 3.8.1. The integration test seems to fail now. I'm not sure exactly what the problem is. I guess you are already aware that the way to run the integration test locally is to first build the docker image with this command: And then run the test |
lhotari
left a comment
There was a problem hiding this comment.
There seems to be even more to resolving the complete protobuf schema as can be seen in the documentation at https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/serdes-protobuf.html .
For example:
- support for multiple message types defined in the same schema
- support for schema references
| // the kafka protobuf serializer encodes the MessageIndexes in the payload, we need to skip them | ||
| if (schemaType == SchemaType.PROTOBUF_NATIVE) { | ||
| MessageIndexes.readFrom(buffer); | ||
| } |
There was a problem hiding this comment.
There seems to be a meaning why the MessageIndexes exists. Each protobuf schema can include multiple message definitions and the MessageIndexes contains a solution for referencing a specific message.
There was a problem hiding this comment.
Once the license issue is resolved, we could come back to this detail. If the code for KafkaProtobufDeserializer would be Apache 2.0 licensed, we could rather safely look at the code and see how the proper schema resolution can be handled for protobuf encoded messages. As long as we have the license issue, we better not copy-paste code due to IPR violation risk.
|
One potential problem is the Confluent schema registry licenses. It says
for example, https://github.com/confluentinc/schema-registry/blob/master/protobuf-serializer/src/main/java/io/confluent/kafka/serializers/protobuf/KafkaProtobufDeserializer.java doesn't have Apache 2.0 license. https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/KafkaAvroDeserializer.java is under Apache 2.0 license. |
|
|
||
| <dependency> | ||
| <groupId>io.confluent</groupId> | ||
| <artifactId>kafka-protobuf-serializer</artifactId> |
There was a problem hiding this comment.
According to the source repository, https://github.com/confluentinc/schema-registry, this library has a license of Confluent Community License. This didn't change between the previous
https://github.com/confluentinc/schema-registry/tree/v6.2.8 version and the currently used https://github.com/confluentinc/schema-registry/tree/v7.8.2 version.
The information provided by Confluent about the license is slightly conflicting since it says that client libraries are with Apache 2.0 license, but at the same time, it says "See LICENSE file in each subfolder for detailed license agreement". However, the license headers for protobuf classes are very explicit about the license. For Avro related classes, there's an Apache 2.0 license header.
There was a problem hiding this comment.
In the pom file https://packages.confluent.io/maven/io/confluent/kafka-protobuf-serializer/7.8.2/kafka-protobuf-serializer-7.8.2.pom, there's Apache 2.0 license. This is also the case for 6.2.8 version, https://packages.confluent.io/maven/io/confluent/kafka-protobuf-serializer/6.2.8/kafka-protobuf-serializer-6.2.8.pom . I wonder if they just have forgotten to make things consistent in the repository.
There was a problem hiding this comment.
I found an existing issue about the license for protobuf libraries: confluentinc/schema-registry#1558
There was a problem hiding this comment.
there's even a branch to fix the issue, dating back to 2021, but it was never merged: https://github.com/confluentinc/schema-registry/compare/protobuf-licensing
There was a problem hiding this comment.
Confluent employee replied in 2021: confluentinc/schema-registry#1558 (comment)
There was a problem hiding this comment.
oh, I didn't check the library license, will close this then
|
We cannot add a dependency on the This is how Claude AI explained the licensing issue:
Unless We could also consider supporting protobuf encoded messages that aren't encoded using Confluent's schema registry client libraries and don't use the schema registry for dynamically retrieving the schema. In those cases, the schema would have to be provided in the source connector config. That would be a different feature than what this PR implements. |
There was a problem hiding this comment.
We cannot add a dependency on the io.confluent:kafka-protobuf-serializer library since it is under the Confluent Community License, as briefly commented in an issue discussion.
|
It looks like AWS Glue Schema Registry Library contains the classes that we'd need for parsing the Confluent Schema Registry protobuf wire format. Under https://github.com/awslabs/aws-glue-schema-registry/tree/master/serializer-deserializer/src/main/java/com/amazonaws/services/schemaregistry there are both serializers and deserializers which we could use as a replacement for the Confluent Community Licensed libraries. We could continue to use the Apache 2.0 licensed Confluent Schema Registry libraries. There are limitations in the AWS Glue libs, such as not supporting imported schemas like Confluent Schema Registry supports: https://github.com/awslabs/aws-glue-schema-registry/blob/72cbca0b05a758f0a775c39e580a15e7f19613fb/serializer-deserializer/src/main/java/com/amazonaws/services/schemaregistry/serializers/protobuf/MessageIndexFinder.java#L71-L72 |
|
I also found Apicurio Schema registry libs which are Apache 2.0 licensed, https://github.com/Apicurio/apicurio-registry/tree/main/serdes/kafka/protobuf-serde/src/main/java/io/apicurio/registry/serde/protobuf . However, it looks like the wire format is Apicurio specific and not compatible with Confluent Schema Registry. |
|
More OSS schema registry clients with protobuf support: |
Motivation
The Kafka source connector support schema registered in schema registry, but it only support Avro for now, while the schema registry also supports Protobuf, it's better to support it too
Modifications
Update the Kafka source connector to make it support Protobuf schema
Verifying this change
Make sure that the change passes the CI checks.
This change added tests and can be verified as follows:
KafkaProtobufDeserializerto create Kafka source connectorDoes this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: jiangpengcheng#39