[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

soumilshah1995 · 2024-09-29T12:35:22Z

Feature Request:

I would like to request support for the Variant Type in Apache Hudi and compatibility with Spark 4.0.0 preview. The new Variant Type introduced in Spark 4.0 significantly improves the performance of handling semi-structured data (such as JSON) and is up to 8X faster compared to working with raw JSON strings. This could greatly enhance Hudi’s efficiency when processing complex data formats.

Why is this needed:

Hudi’s support for large-scale data management would benefit greatly from the ability to handle semi-structured data types, such as those managed through the Variant Type.
As more organizations transition to Spark 4.0 (once officially released), maintaining compatibility will ensure that Hudi remains up-to-date with modern data processing pipelines.
Efficient handling of complex data formats like JSON and XML would make Hudi a more versatile solution for data lakes.
Use Case:

Processing and storing large datasets that contain a mix of structured and semi-structured data.
Leveraging Variant Type for faster querying and reduced overhead when dealing with nested and complex data structures.
Compatibility with Spark 4.0 will help early adopters of the latest Apache Spark features to continue using Hudi seamlessly in their pipelines.
References:

Spark 4.0.0 Preview with Variant Type Support
Delta Lake Variant Type Documentation
Additional Context: Currently, AWS services like EMR and Glue do not fully support Spark 4.0, but as these platforms are expected to adopt it in the near future, adding early support in Hudi would make the transition smoother for users.

ad1happy2go added the feature-enquiry issue contains feature enquiries/requests or great improvement ideas label Oct 1, 2024

ad1happy2go added this to Hudi Issue Support Oct 1, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

soumilshah1995 commented Sep 29, 2024

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

Comments

soumilshah1995 commented Sep 29, 2024