You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to request support for the Variant Type in Apache Hudi and compatibility with Spark 4.0.0 preview. The new Variant Type introduced in Spark 4.0 significantly improves the performance of handling semi-structured data (such as JSON) and is up to 8X faster compared to working with raw JSON strings. This could greatly enhance Hudi’s efficiency when processing complex data formats.
Why is this needed:
Hudi’s support for large-scale data management would benefit greatly from the ability to handle semi-structured data types, such as those managed through the Variant Type.
As more organizations transition to Spark 4.0 (once officially released), maintaining compatibility will ensure that Hudi remains up-to-date with modern data processing pipelines.
Efficient handling of complex data formats like JSON and XML would make Hudi a more versatile solution for data lakes.
Use Case:
Processing and storing large datasets that contain a mix of structured and semi-structured data.
Leveraging Variant Type for faster querying and reduced overhead when dealing with nested and complex data structures.
Compatibility with Spark 4.0 will help early adopters of the latest Apache Spark features to continue using Hudi seamlessly in their pipelines.
References:
Feature Request:
I would like to request support for the Variant Type in Apache Hudi and compatibility with Spark 4.0.0 preview. The new Variant Type introduced in Spark 4.0 significantly improves the performance of handling semi-structured data (such as JSON) and is up to 8X faster compared to working with raw JSON strings. This could greatly enhance Hudi’s efficiency when processing complex data formats.
Why is this needed:
Hudi’s support for large-scale data management would benefit greatly from the ability to handle semi-structured data types, such as those managed through the Variant Type.
As more organizations transition to Spark 4.0 (once officially released), maintaining compatibility will ensure that Hudi remains up-to-date with modern data processing pipelines.
Efficient handling of complex data formats like JSON and XML would make Hudi a more versatile solution for data lakes.
Use Case:
Processing and storing large datasets that contain a mix of structured and semi-structured data.
Leveraging Variant Type for faster querying and reduced overhead when dealing with nested and complex data structures.
Compatibility with Spark 4.0 will help early adopters of the latest Apache Spark features to continue using Hudi seamlessly in their pipelines.
References:
Spark 4.0.0 Preview with Variant Type Support
Delta Lake Variant Type Documentation
Additional Context: Currently, AWS services like EMR and Glue do not fully support Spark 4.0, but as these platforms are expected to adopt it in the near future, adding early support in Hudi would make the transition smoother for users.
The text was updated successfully, but these errors were encountered: