Skip to content

[FLINK-32609] Support Projection Pushdown #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fqshopify
Copy link

@fqshopify fqshopify commented May 9, 2025

Implements the SupportsProjectPushdown interface for KafkaDynamicSource

Benefits

  1. Improved performance
    • Unneeded columns will be filtered out at an earlier stage of processing (specficially in the TableSourceScan node).
    • The amount of improvement in performance will vary depending on:
      • The number of columns selected
      • The number of columns in the source data
      • The DecodingFormat
      • etc.
  2. Improved resiliency
    • Currently SQL queries will fail if any field in the table undergoes a breaking schema change, even if the SQL query itself does not depend on that field.
    • After the changes in this PR, SQL queries will continue to work even if fields that they do not depend on experience breaking schema changes. See testBreakingSchemaChanges for an example.
    • This improvement is generally only applicable to ProjectableDecodingFormat that can decode messages independently/dynamically e.g. json, avro-confluent, debezium-avro-confluent.

Limitations

  1. We cannot push projections all the way down into Kafka.
    • Projection pushdown is primarily used as an I/O optimization technique by pushing projections down all the way into the storage layer.
    • Unfortunately our storage layer, Kafka, does not support projection pushdown. As a result, we are not able to push projections down further than the deserialization step of our Flink pipelines.
    • We're still able to improve performance by eliminating unnecessary columns at an earlier stage of processing within our Flink pipeline but the performance benefits are relatively lower than they would have been otherwise.

Challenges

  1. AvroDecodingFormat is not actually projectable
    • The AvroDecodingFormat claims to be a ProjectableDecodingFormat but is not actually projectable AFAICT.
    • It appears that I’m not the first person to discover this issue: FLINK-35324
    • Possible remediations:
      • Fix the AvroDecodingFormat to actually be projectable
        • I think this is impossible based on how Avro works
      • Change AvroDecodingFormat so that it implements just DecodingFormat
        • I think this is the right solution but this is likely a breaking change
      • Add an optional configuration to disable pushing down projections into the decoder
        • This is what I've done in the prototype to unblock myself temporarily

Next steps

  • Get feedback from the community
  • Raise a FLIP if necessary
  • Finish PR:
    • Align on a solution for AvroDecodingFormat issue
    • Add support for nested projections
    • Clean up code, more unit tests, etc.
  • Open up for formal PR reviews (early reviews/comments are still welcome though!)

Notes for reviewers

Please note that the code included in this PR is currently in the working-prototype stage and is mostly intended to facilitate discussion. I can definitely clean up the code and I'm open to changing things.

@fqshopify fqshopify force-pushed the support_projection_pushdown branch 2 times, most recently from dbc1310 to 218aec0 Compare May 9, 2025 12:53
@fqshopify fqshopify force-pushed the support_projection_pushdown branch from 218aec0 to 1192c9b Compare May 9, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants