Replies: 8 comments 15 replies
-
I'm curious that if StringView/BinaryView [1] and ListView in arrow is approved, would it matter to just use arrow Parquet writer? |
Beta Was this translation helpful? Give feedback.
-
@pedroerp slight correction. We used to use DuckDB for the Parquet Reader. |
Beta Was this translation helpful? Give feedback.
-
+1 I like this proposal. TableWriter performance doesn't matter for interactive queries, but it uses significant amount of CPU in larger batch / ETL queries. It would be great to build high performance table writer for Parquet. Aside from that, simplifying dependencies and speeding up build times is always welcome. |
Beta Was this translation helpful? Give feedback.
-
@pedroerp I agree with the first step to copy Arrow Parquet writer into Velox first. In fact, this was my original proposal when we started Parquet reader as well. Talking about improvements, the encoders will play the crucial role and I'm confident to make them faster and less costly. We can help improve the Parquet writer in the future, hopefully next half. |
Beta Was this translation helpful? Give feedback.
-
@pedroerp Thank you for your excellent work on the native Parquet write feature. We are currently integrating this functionality into Gluten. However, we encountered an issue when using Snappy compression, as it throws a 'Support for codec SNAPPY not built' error.
Do you have any suggestions or insights regarding this problem?" |
Beta Was this translation helpful? Give feedback.
-
I will pick the task of moving GZIP compression from ARROW, if it hasn't been picked up yet. |
Beta Was this translation helpful? Give feedback.
-
The tasks for this migration are as follows:
|
Beta Was this translation helpful? Give feedback.
-
I recently came across this thread and noticed that the discussion on compression might overlap with the topics covered here #7471. However, I have submitted several PR #7589 #7603 that primarily utilizes Arrow's compression codec and includes an API used in the Velox dwio reader. @majetideepak and @karteekmurthys, could you please review them? Also, can we consider using these PRs as a means to unify the compression codecs within dwio and Arrow? cc: @FelixYBW |
Beta Was this translation helpful? Give feedback.
-
The first Velox implementation of Parquet readers and writers were done by wrapping around the ones provided by DuckDB and Arrow, respectively.
Since DuckDB's Parquet reader did not provide the right APIs to support push-downs and other advanced features needed (considering how performance sensitive table scans are), we eventually replaced it by our own implementation, known in the codebase as "native parquet reader", which is faster than the one provided by DuckDB.
Because table writes are less performance sensitive, we continued using Arrow's parquet writer. The proposal is to replace Arrow's parquet writer in Velox by a "native parquet writer" for the following reasons:
The proposal is to take Arrow's Parquet writer from upstream as a initial implementation (along with tests, docs, etc), and make the adaptations required to make it part of Velox.
Thoughts? Cc: @mbasmanova @oerling @xiaoxmeng @majetideepak @Yuhta @yingsu00
Beta Was this translation helpful? Give feedback.
All reactions