-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43598: [C++][Parquet] Parquet Metadata Printer supports print sort-columns #43599
GH-43598: [C++][Parquet] Parquet Metadata Printer supports print sort-columns #43599
Conversation
|
After this patch: {
"Version": "2.6",
"CreatedBy": "parquet-cpp-arrow version 16.1.0",
"TotalRows": "3",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "2",
"NumberOfColumns": "2",
"Columns": [
{ "Id": "0", "Name": "a", "PhysicalType": "INT64", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
{ "Id": "1", "Name": "b", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "166", "TotalCompressedBytes": "174", "SortColumns": [{"column_idx":0, "descending":1, "nulls_first": 1}, {"column_idx":1, "descending":0, "nulls_first": 0}], "Rows": "3",
"ColumnChunks": [
{"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
{"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
]
}
]
} |
e116c3c
to
826fdb7
Compare
Should we break a new line for each item in |
Update the output: {
"FileName": "sort_columns.parquet",
"Version": "2.6",
"CreatedBy": "parquet-cpp-arrow version 16.1.0",
"TotalRows": "6",
"NumberOfRowGroups": "2",
"NumberOfRealColumns": "2",
"NumberOfColumns": "2",
"Columns": [
{ "Id": "0", "Name": "a", "PhysicalType": "INT64", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
{ "Id": "1", "Name": "b", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "166", "TotalCompressedBytes": "174", "SortColumns": [
{"column_idx":0, "descending":1, "nulls_first": 1},
{"column_idx":1, "descending":0, "nulls_first": 0}
], "Rows": "3",
"ColumnChunks": [
{"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
{"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
]
},
{
"Id": "1", "TotalBytes": "166", "TotalCompressedBytes": "174", "SortColumns": [
{"column_idx":0, "descending":1, "nulls_first": 1},
{"column_idx":1, "descending":0, "nulls_first": 0}
], "Rows": "3",
"ColumnChunks": [
{"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
{"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
"Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
]
}
]
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks @mapleFU!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just 2 nits
@@ -142,6 +142,15 @@ void ParquetFilePrinter::DebugPrint(std::ostream& stream, std::list<int> selecte | |||
stream << "--- Total Bytes: " << group_metadata->total_byte_size() << " ---\n"; | |||
stream << "--- Total Compressed Bytes: " << group_metadata->total_compressed_size() | |||
<< " ---\n"; | |||
auto sorting_columns = group_metadata->sorting_columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
auto sorting_columns = group_metadata->sorting_columns(); | |
const auto& sorting_columns = group_metadata->sorting_columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<SortingColumn> RowGroupMetaData::sorting_columns() const
Seems that this returns a vector directly, and just auto is ok here...
@@ -285,6 +294,21 @@ void ParquetFilePrinter::JSONPrint(std::ostream& stream, std::list<int> selected | |||
stream << " \"TotalBytes\": \"" << group_metadata->total_byte_size() << "\", "; | |||
stream << " \"TotalCompressedBytes\": \"" << group_metadata->total_compressed_size() | |||
<< "\", "; | |||
auto row_group_sorting_columns = group_metadata->sorting_columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar nit
auto row_group_sorting_columns = group_metadata->sorting_columns(); | |
const auto& row_group_sorting_columns = group_metadata->sorting_columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 03c3f8e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 32 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Now we have "sort-columns" support in Parquet spec, Python ( https://github.com/apache/arrow/pull/37665/files ) and C++. We can support print it in metadata Printer
What changes are included in this PR?
Add "SortingColumns" support in parquet printer
Are these changes tested?
Are there any user-facing changes?
No