-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40592: [C++][Parquet] Implement SizeStatistics #40594
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level questions/suggestions.
8661324
to
90caf32
Compare
Finally this PR is complete on my side. Please take a look when you have time. Thanks! @emkornfield @pitrou @mapleFU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @wgtmac . This is a first partial review, I'll go over the rest once these comments are answered or addressed :-)
/// \param size_statistics pointer to the thrift SizeStatistics structure. | ||
/// \param descr column descriptor for the column. | ||
/// \returns SizeStatistics object. Its lifetime is not bound to the input. | ||
static std::unique_ptr<SizeStatistics> Make(const void* size_statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using the pimpl idiom, then you should just return a SizeStatistics
here, since all the implementation is already inside a std::unique_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, you could also remove the pimpl idiom and return a subclass here instead. This is better if you want to be able to pass an optionally null pointer, or store a shared_ptr at some pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the pimpl idiom of class FileMetaData
:
arrow/cpp/src/parquet/metadata.h
Lines 275 to 279 in 14384ac
/// \brief Create a FileMetaData from a serialized thrift message. | |
static std::shared_ptr<FileMetaData> Make( | |
const void* serialized_metadata, uint32_t* inout_metadata_len, | |
const ReaderProperties& properties = default_reader_properties(), | |
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR); |
Returning a SizeStatistics
instead of std::unique_ptr<SizeStatistics>
make it impossible to store it in a smart pointer, which is on the contrary of the convention in this codebase.
Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.
What do you mean, frequently?
(ideally this would be a simple struct:
struct SizeStatistics {
std::optional<int64_t> unencoded_byte_array_data_bytes;
std::vector<int64_t> definition_level_histogram;
std::vector<int64_t> repetition_level_histogram;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me change this.
void AddValuesSpaced(const ByteArray* values, const uint8_t* valid_bits, | ||
int64_t valid_bits_offset, int64_t num_spaced_values); | ||
|
||
/// \brief Add dense BYTE_ARRAY values. | ||
/// \param values pointer to values of BYTE_ARRAY type. | ||
/// \param num_values length of values. | ||
void AddValues(const ByteArray* values, int64_t num_values); | ||
|
||
/// \brief Add BYTE_ARRAY values in the arrow array. | ||
void AddValues(const ::arrow::Array& values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more logical for the BYTE_ARRAY encoders to accumulate the unencoded_byte_array_data_bytes
, instead of visiting the input data again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two cases where BYTE_ARRAY encoders do not work:
- When dictionary encoding is enabled.
- When the input data is in a
arrow::DictionaryArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two cases supposed to produce a unencoded_byte_array_data_bytes
? In any case, the approach used here seems wasteful.
going to do another pass through, CI failure looks like a formatting issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm OK with this as long as @pitrou is thank you for driving this.
@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again. |
0449426
to
a83ed41
Compare
@pitrou @emkornfield @mapleFU Gentle ping :) |
page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false); | ||
} | ||
if (page_size_stats_builder_) { | ||
page_size_stats_builder_->AddValues(*referenced_dictionary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are dictionary-decoding the entire array just to run basic statistics? This seems incredibly wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike the approach as you did. It seems that it has been used for collecting page statistics already for a long time. Do you think it is better to fix it in a separate PR or just do it in this one altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally ill-designed and would deserve a rethink to avoid glaring inefficiencies. -1 from me on this PR.
(see the comments I posted above for more focussed complaints)
Rationale for this change
Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.
What changes are included in this PR?
Implement reading and writing size statistics for parquet-cpp.
Are these changes tested?
Yes, a bunch of test cases have been added.
Are there any user-facing changes?
Yes, now parquet users are able to read and write size statistics.