Skip to content

[FLINK-39200][mysql] Skip insert/update/delete binlog deserialization of unsubscribed tables in MySql CDC binary log client.#4302

Open
chengcongchina wants to merge 2 commits intoapache:masterfrom
chengcongchina:FLINK-39200
Open

[FLINK-39200][mysql] Skip insert/update/delete binlog deserialization of unsubscribed tables in MySql CDC binary log client.#4302
chengcongchina wants to merge 2 commits intoapache:masterfrom
chengcongchina:FLINK-39200

Conversation

@chengcongchina
Copy link

@chengcongchina chengcongchina commented Mar 5, 2026

This closes FLINK-39200.

What is the purpose of the change

Currently, the MySQL row event deserializers in Flink CDC always fully deserialize WRITE/UPDATE/DELETE binlog events for all tables, regardless of whether the tables are actually subscribed by the connector. This leads to unnecessary CPU and memory overhead, especially when only a small portion of tables are monitored.

This PR implements early skipping of deserialization for unsubscribed tables in binary log client to improve performance. This optimization is applied by default. No table option is added to allow turning off this improvement, since there is no valid use case to deserialize events for unsubscribed tables.

Brief change log

  • Introduce TableIdFilter to support filtering tables during event deserialization.
  • Add optimized event data deserializers (WriteRowsEventDataDeserializer, UpdateRowsEventDataDeserializer, DeleteRowsEventDataDeserializer) that support skipping deserialization for excluded tables.
  • Update MySqlStreamingChangeEventSource to use the TableIdFilter based on the connector's table subscription configuration.

Verifying this change

This change is verified by existing tests:
- MySqlPipelineITCase
- BinlogSplitReaderTest

The second commit add unit tests.

Performance benchmarks can be added later to quantify the CPU/memory savings, which could be discussed in FLINK-37743

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes
  • Anything that affects deployment or recovery: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

… of unsubscribed tables in MySql CDC binary log client.
@ThorneANN
Copy link
Contributor

ThorneANN commented Mar 9, 2026

Thank u for your contributor and i left some comment.
Why not filter tableId instand of event actions?

@chengcongchina
Copy link
Author

@ThorneANN Thank you for your comment.
In the MySQL binlog protocol, apart from events like GTID events, there are mainly two types of events related to table names: DDL changes and row data changes (INSERT/UPDATE/DELETE).
Currently, DDL changes are captured for all tables, partly because downstream DDL capture for online DDL tools may require DDL parsing for shadow tables. Aside from that, the main deserialization workload in the binlog client lies in parsing row data changes. Therefore, adding the table ID filtering logic specifically within the deserialization of these row change events is sufficient to achieve the optimization.

@ThorneANN
Copy link
Contributor

@ThorneANN Thank you for your comment. In the MySQL binlog protocol, apart from events like GTID events, there are mainly two types of events related to table names: DDL changes and row data changes (INSERT/UPDATE/DELETE). Currently, DDL changes are captured for all tables, partly because downstream DDL capture for online DDL tools may require DDL parsing for shadow tables. Aside from that, the main deserialization workload in the binlog client lies in parsing row data changes. Therefore, adding the table ID filtering logic specifically within the deserialization of these row change events is sufficient to achieve the optimization.

Yes, the point that confuses me is that I noticed you added a lot of event serialization classes because I think you only need to filter tableids

@ThorneANN
Copy link
Contributor

May be we modify TableMapEventDataDeserializer and MySqlStreamingChangeEventSource 's classes to achieve this feature
?

@ThorneANN
Copy link
Contributor

 eventDeserializer.setEventDataDeserializer(
                EventType.TABLE_MAP,
                new com.github.shyiko.mysql.binlog.event.deserialization
                        .TableMapEventDataDeserializer(
                        connectorConfig.getTableFilters().dataCollectionFilter()::isIncluded));

As the code show , flink-cdc also skip the data without capture table ,but always deserializer dataEvent's mata

@ThorneANN
Copy link
Contributor

This pr i edit for last month ,u can cc .https://github.com/ThorneANN/flink-cdc/tree/serrizer-only-tablels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants