Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40592: [C++][Parquet] Implement SizeStatistics #40594

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 16, 2024

Rationale for this change

Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.

What changes are included in this PR?

Implement reading and writing size statistics for parquet-cpp.

Are these changes tested?

Yes, a bunch of test cases have been added.

Are there any user-facing changes?

Yes, now parquet users are able to read and write size statistics.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 17, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 19, 2024
@wgtmac wgtmac marked this pull request as ready for review April 5, 2024 15:39
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 10, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level questions/suggestions.

@wgtmac wgtmac force-pushed the size_stats branch 2 times, most recently from 8661324 to 90caf32 Compare July 10, 2024 15:32
@wgtmac
Copy link
Member Author

wgtmac commented Jul 10, 2024

Finally this PR is complete on my side. Please take a look when you have time. Thanks! @emkornfield @pitrou @mapleFU

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @wgtmac . This is a first partial review, I'll go over the rest once these comments are answered or addressed :-)

/// \param size_statistics pointer to the thrift SizeStatistics structure.
/// \param descr column descriptor for the column.
/// \returns SizeStatistics object. Its lifetime is not bound to the input.
static std::unique_ptr<SizeStatistics> Make(const void* size_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using the pimpl idiom, then you should just return a SizeStatistics here, since all the implementation is already inside a std::unique_ptr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, you could also remove the pimpl idiom and return a subclass here instead. This is better if you want to be able to pass an optionally null pointer, or store a shared_ptr at some pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pimpl idiom of class FileMetaData:

/// \brief Create a FileMetaData from a serialized thrift message.
static std::shared_ptr<FileMetaData> Make(
const void* serialized_metadata, uint32_t* inout_metadata_len,
const ReaderProperties& properties = default_reader_properties(),
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

Returning a SizeStatistics instead of std::unique_ptr<SizeStatistics> make it impossible to store it in a smart pointer, which is on the contrary of the convention in this codebase.

Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.

What do you mean, frequently?

(ideally this would be a simple struct:

struct SizeStatistics {
  std::optional<int64_t> unencoded_byte_array_data_bytes;
  std::vector<int64_t> definition_level_histogram;
  std::vector<int64_t> repetition_level_histogram;
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me change this.

cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.h Outdated Show resolved Hide resolved
Comment on lines +136 to +144
void AddValuesSpaced(const ByteArray* values, const uint8_t* valid_bits,
int64_t valid_bits_offset, int64_t num_spaced_values);

/// \brief Add dense BYTE_ARRAY values.
/// \param values pointer to values of BYTE_ARRAY type.
/// \param num_values length of values.
void AddValues(const ByteArray* values, int64_t num_values);

/// \brief Add BYTE_ARRAY values in the arrow array.
void AddValues(const ::arrow::Array& values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more logical for the BYTE_ARRAY encoders to accumulate the unencoded_byte_array_data_bytes, instead of visiting the input data again here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases where BYTE_ARRAY encoders do not work:

  1. When dictionary encoding is enabled.
  2. When the input data is in a arrow::DictionaryArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two cases supposed to produce a unencoded_byte_array_data_bytes? In any case, the approach used here seems wasteful.

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

going to do another pass through, CI failure looks like a formatting issue.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 6, 2024
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm OK with this as long as @pitrou is thank you for driving this.

cpp/src/parquet/column_page.h Show resolved Hide resolved
cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/size_statistics.cc Outdated Show resolved Hide resolved
@wgtmac
Copy link
Member Author

wgtmac commented Aug 7, 2024

@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again.

@wgtmac
Copy link
Member Author

wgtmac commented Nov 12, 2024

@pitrou @emkornfield @mapleFU Gentle ping :)

page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false);
}
if (page_size_stats_builder_) {
page_size_stats_builder_->AddValues(*referenced_dictionary);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are dictionary-decoding the entire array just to run basic statistics? This seems incredibly wasteful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the approach as you did. It seems that it has been used for collecting page statistics already for a long time. Do you think it is better to fix it in a separate PR or just do it in this one altogether?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally ill-designed and would deserve a rethink to avoid glaring inefficiencies. -1 from me on this PR.

(see the comments I posted above for more focussed complaints)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants