Native Parquet Writer #6193

pedroerp · 2023-08-22T02:42:40Z

pedroerp
Aug 22, 2023
Collaborator

The first Velox implementation of Parquet readers and writers were done by wrapping around the ones provided by DuckDB and Arrow, respectively.

Since DuckDB's Parquet reader did not provide the right APIs to support push-downs and other advanced features needed (considering how performance sensitive table scans are), we eventually replaced it by our own implementation, known in the codebase as "native parquet reader", which is faster than the one provided by DuckDB.

Because table writes are less performance sensitive, we continued using Arrow's parquet writer. The proposal is to replace Arrow's parquet writer in Velox by a "native parquet writer" for the following reasons:

Arrow's Parquet writer brings a full dependency on the Arrow C++ library.
Arrow Parquet writer uses thrift, while on Velox we are consolidating on fbthrift (Move Thrift dependency to FBThrift #6163). By creating a native writer we would remove both dependencies (arrow and thrift), simplifying our stack and speeding up build times.
A native writer would work on Velox Vectors natively, avoiding the C ABI conversion to/from Arrow Arrays.
A native writer implementation gives us more flexibility to add features and performance optimizations as we see fit.

The proposal is to take Arrow's Parquet writer from upstream as a initial implementation (along with tests, docs, etc), and make the adaptations required to make it part of Velox.

Thoughts? Cc: @mbasmanova @oerling @xiaoxmeng @majetideepak @Yuhta @yingsu00

mapleFU · 2023-08-22T02:54:18Z

mapleFU
Aug 22, 2023

I'm curious that if StringView/BinaryView [1] and ListView in arrow is approved, would it matter to just use arrow Parquet writer?

[1] apache/arrow#35628

7 replies

Yuhta Aug 22, 2023
Collaborator

Yes the main reason is the conflict between apache-thrift and fbthrift

wgtmac Aug 22, 2023

Could you please elaborate? I am just curious of the conflict between apache-thrift and fbthrift. Is it related to the generated code or just some configuration stuff? @Yuhta

majetideepak Aug 22, 2023
Collaborator

@wgtmac Arrow checks in the Parquet SerDe code that was generated by Thrift. This generated code is not compatible with FBThrift. There are some differences in headers.
For example, Pedro observed some of the file includes (like TApplicationException.h) are in a different directory. thrift in thrift vs thrift/lib/cpp in fbthrift.`

pedroerp Aug 22, 2023
Collaborator Author

And even worse: the code compiles and links just fine, but there are symbols collisions that happen at runtime (since the symbols from the two libraries might clash), and they eventually cause wrong behavior and crashes. There might be ways to go around that by dynamically linking and restricting the symbols exported or namespacing the symbols, but none of them are trivial. They would increase complexity and help pile up tech debt.

wgtmac Aug 23, 2023

Thanks for the explanation! @majetideepak @pedroerp

majetideepak · 2023-08-22T15:45:00Z

majetideepak
Aug 22, 2023
Collaborator

@pedroerp slight correction. We used to use DuckDB for the Parquet Reader.
But +1 on this proposal. We also have an issue updating Arrow due to Meta's internal toolchain.
Moving the code will help bring the latest implementation apart from all the advantages listed above.

1 reply

pedroerp Aug 22, 2023
Collaborator Author

Thanks @majetideepak. I updated the description.

mbasmanova · 2023-08-22T22:41:47Z

mbasmanova
Aug 22, 2023
Collaborator

+1 I like this proposal. TableWriter performance doesn't matter for interactive queries, but it uses significant amount of CPU in larger batch / ETL queries. It would be great to build high performance table writer for Parquet. Aside from that, simplifying dependencies and speeding up build times is always welcome.

0 replies

yingsu00 · 2023-09-19T02:08:44Z

yingsu00
Sep 19, 2023
Collaborator

@pedroerp I agree with the first step to copy Arrow Parquet writer into Velox first. In fact, this was my original proposal when we started Parquet reader as well. Talking about improvements, the encoders will play the crucial role and I'm confident to make them faster and less costly. We can help improve the Parquet writer in the future, hopefully next half.

0 replies

JkSelf · 2023-09-22T06:11:00Z

JkSelf
Sep 22, 2023
Collaborator

@pedroerp Thank you for your excellent work on the native Parquet write feature. We are currently integrating this functionality into Gluten. However, we encountered an issue when using Snappy compression, as it throws a 'Support for codec SNAPPY not built' error.
It appears that only UNCOMPRESSION is supported, and #ifdef ARROW_WITH_SNAPPY returns false. I attempted to resolve this by adding add_definitions(-DARROW_WITH_SNAPPY) in the CMakeFile. Unfortunately, it resulted in the following compilation issue.

/usr/bin/ld: velox/dwio/parquet/writer/arrow/util/libvelox_dwio_arrow_parquet_writer_util_lib.a(Compression.cpp.o): in function `facebook::velox::parquet::arrow::util::Codec::Create(facebook::velox::parquet::arrow::Compression::type, facebook::velox::parquet::arrow::util::CodecOptions const&)':
/mnt/DP_disk1/sparkuser/projects/fb-upstream-velox/velox/_build/debug/../../velox/dwio/parquet/writer/arrow/util/Compression.cpp:175: undefined reference to `facebook::velox::parquet::arrow::util::internal::MakeSnappyCodec()'
collect2: error: ld returned 1 exit status

Do you have any suggestions or insights regarding this problem?"

4 replies

mbasmanova Sep 22, 2023
Collaborator

CC: @majetideepak @yingsu00 @oerling

pedroerp Sep 22, 2023
Collaborator Author

@JkSelf thanks for reporting this. I'll look into it this afternoon.

pedroerp Sep 22, 2023
Collaborator Author

@JkSelf #6702

JkSelf Sep 26, 2023
Collaborator

@pedroerp Thanks for your quick fixing. Can you also adding the compression support for GZIP, LZO, LZ4, BROTLI, ZSTD. BZ2?

karteekmurthys · 2023-09-27T18:27:07Z

karteekmurthys
Sep 27, 2023

I will pick the task of moving GZIP compression from ARROW, if it hasn't been picked up yet.

cc: @pedroerp @majetideepak

2 replies

pedroerp Sep 27, 2023
Collaborator Author

Thanks @karteekmurthys , I can help review.

majetideepak Sep 27, 2023
Collaborator

Thanks @karteekmurthys. GZIP is the default compression format in Presto and a couple of people reported a regression with CTAS.

majetideepak · 2023-10-05T15:22:56Z

majetideepak
Oct 5, 2023
Collaborator

The tasks for this migration are as follows:
I am guessing we will find more as we start working on these tasks.

Move the Parquet writer code as is from Arrow to Velox. [Completed]
Move the corresponding tests from Arrow to Velox. [Completed]
Use FBThrift for Parquet format instead of regular Thrift. This will remove the dependency on Arrow Parquet library (libparquet.a).
Use Velox buffers. This will completely remove the dependency on the Arrow library (libarrow.a). Use Velox buffers in Parquet Arrow Writer #6985
Use Velox compression code. Consolidate Arrow Parquet writer compression code into dwio::common::compression #6901
Remove the Arrow Bridge layer. We must directly use the Parquet Column Writers.
Consolidate Arrow Writer tests with Velox writer tests.
Change the remaining Arrow code to match Velox coding guidelines.

1 reply

majetideepak Oct 5, 2023
Collaborator

Arrow uses Status to handle return values. It is an open question if we want to retain this given there is a discussion here #6323 to move Velox to use Status overall.

marin-ma · 2023-11-28T02:14:02Z

marin-ma
Nov 28, 2023

I recently came across this thread and noticed that the discussion on compression might overlap with the topics covered here #7471. However, I have submitted several PR #7589 #7603 that primarily utilizes Arrow's compression codec and includes an API used in the Velox dwio reader. @majetideepak and @karteekmurthys, could you please review them? Also, can we consider using these PRs as a means to unify the compression codecs within dwio and Arrow?

cc: @FelixYBW

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Parquet Writer #6193

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Native Parquet Writer #6193

pedroerp Aug 22, 2023 Collaborator

Replies: 8 comments · 15 replies

mapleFU Aug 22, 2023

Yuhta Aug 22, 2023 Collaborator

wgtmac Aug 22, 2023

majetideepak Aug 22, 2023 Collaborator

pedroerp Aug 22, 2023 Collaborator Author

wgtmac Aug 23, 2023

majetideepak Aug 22, 2023 Collaborator

pedroerp Aug 22, 2023 Collaborator Author

mbasmanova Aug 22, 2023 Collaborator

yingsu00 Sep 19, 2023 Collaborator

JkSelf Sep 22, 2023 Collaborator

mbasmanova Sep 22, 2023 Collaborator

pedroerp Sep 22, 2023 Collaborator Author

pedroerp Sep 22, 2023 Collaborator Author

JkSelf Sep 26, 2023 Collaborator

karteekmurthys Sep 27, 2023

pedroerp Sep 27, 2023 Collaborator Author

majetideepak Sep 27, 2023 Collaborator

majetideepak Oct 5, 2023 Collaborator

majetideepak Oct 5, 2023 Collaborator

marin-ma Nov 28, 2023

pedroerp
Aug 22, 2023
Collaborator

Replies: 8 comments 15 replies

mapleFU
Aug 22, 2023

Yuhta Aug 22, 2023
Collaborator

majetideepak Aug 22, 2023
Collaborator

pedroerp Aug 22, 2023
Collaborator Author

majetideepak
Aug 22, 2023
Collaborator

pedroerp Aug 22, 2023
Collaborator Author

mbasmanova
Aug 22, 2023
Collaborator

yingsu00
Sep 19, 2023
Collaborator

JkSelf
Sep 22, 2023
Collaborator

mbasmanova Sep 22, 2023
Collaborator

pedroerp Sep 22, 2023
Collaborator Author

pedroerp Sep 22, 2023
Collaborator Author

JkSelf Sep 26, 2023
Collaborator

karteekmurthys
Sep 27, 2023

pedroerp Sep 27, 2023
Collaborator Author

majetideepak Sep 27, 2023
Collaborator

majetideepak
Oct 5, 2023
Collaborator

majetideepak Oct 5, 2023
Collaborator

marin-ma
Nov 28, 2023