Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43598: [C++][Parquet] Parquet Metadata Printer supports print sort-columns #43599

Merged

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 7, 2024

Rationale for this change

Now we have "sort-columns" support in Parquet spec, Python ( https://github.com/apache/arrow/pull/37665/files ) and C++. We can support print it in metadata Printer

What changes are included in this PR?

Add "SortingColumns" support in parquet printer

Are these changes tested?

Are there any user-facing changes?

No

@mapleFU mapleFU requested a review from wgtmac as a code owner August 7, 2024 07:33
Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43598 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 7, 2024

After this patch:

{
  "Version": "2.6",
  "CreatedBy": "parquet-cpp-arrow version 16.1.0",
  "TotalRows": "3",
  "NumberOfRowGroups": "1",
  "NumberOfRealColumns": "2",
  "NumberOfColumns": "2",
  "Columns": [
     { "Id": "0", "Name": "a", "PhysicalType": "INT64", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
     { "Id": "1", "Name": "b", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} }
  ],
  "RowGroups": [
     {
       "Id": "0",  "TotalBytes": "166",  "TotalCompressedBytes": "174",  "SortColumns": [{"column_idx":0, "descending":1, "nulls_first": 1}, {"column_idx":1, "descending":0, "nulls_first": 0}],  "Rows": "3",
       "ColumnChunks": [
          {"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
          {"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
        ]
     }
  ]
}

@mapleFU mapleFU force-pushed the parquet-metadata-printer-add-sorting-columns branch from e116c3c to 826fdb7 Compare August 7, 2024 09:00
@wgtmac
Copy link
Member

wgtmac commented Aug 7, 2024

Should we break a new line for each item in SortColumns? It currently looks a little bit lengthy to me.

@mapleFU
Copy link
Member Author

mapleFU commented Nov 8, 2024

Update the output:

{
  "FileName": "sort_columns.parquet",
  "Version": "2.6",
  "CreatedBy": "parquet-cpp-arrow version 16.1.0",
  "TotalRows": "6",
  "NumberOfRowGroups": "2",
  "NumberOfRealColumns": "2",
  "NumberOfColumns": "2",
  "Columns": [
     { "Id": "0", "Name": "a", "PhysicalType": "INT64", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} },
     { "Id": "1", "Name": "b", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} }
  ],
  "RowGroups": [
     {
       "Id": "0",  "TotalBytes": "166",  "TotalCompressedBytes": "174",  "SortColumns": [
         {"column_idx":0, "descending":1, "nulls_first": 1},
         {"column_idx":1, "descending":0, "nulls_first": 0}
       ],  "Rows": "3",
       "ColumnChunks": [
          {"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
          {"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
        ]
     },
     {
       "Id": "1",  "TotalBytes": "166",  "TotalCompressedBytes": "174",  "SortColumns": [
         {"column_idx":0, "descending":1, "nulls_first": 1},
         {"column_idx":1, "descending":0, "nulls_first": 0}
       ],  "Rows": "3",
       "ColumnChunks": [
          {"Id": "0", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "1", "Max": "2", "Min": "1" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "100", "CompressedSize": "104" },
          {"Id": "1", "Values": "3", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "c", "Min": "a" },
           "Compression": "SNAPPY", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "66", "CompressedSize": "70" }
        ]
     }
  ]
}

@mapleFU
Copy link
Member Author

mapleFU commented Nov 8, 2024

@wgtmac @pitrou would you mind take a look?

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks @mapleFU!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 9, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just 2 nits

@@ -142,6 +142,15 @@ void ParquetFilePrinter::DebugPrint(std::ostream& stream, std::list<int> selecte
stream << "--- Total Bytes: " << group_metadata->total_byte_size() << " ---\n";
stream << "--- Total Compressed Bytes: " << group_metadata->total_compressed_size()
<< " ---\n";
auto sorting_columns = group_metadata->sorting_columns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
auto sorting_columns = group_metadata->sorting_columns();
const auto& sorting_columns = group_metadata->sorting_columns();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector<SortingColumn> RowGroupMetaData::sorting_columns() const

Seems that this returns a vector directly, and just auto is ok here...

@@ -285,6 +294,21 @@ void ParquetFilePrinter::JSONPrint(std::ostream& stream, std::list<int> selected
stream << " \"TotalBytes\": \"" << group_metadata->total_byte_size() << "\", ";
stream << " \"TotalCompressedBytes\": \"" << group_metadata->total_compressed_size()
<< "\", ";
auto row_group_sorting_columns = group_metadata->sorting_columns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar nit

Suggested change
auto row_group_sorting_columns = group_metadata->sorting_columns();
const auto& row_group_sorting_columns = group_metadata->sorting_columns();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@mapleFU mapleFU merged commit 03c3f8e into apache:main Nov 9, 2024
40 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Nov 9, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 03c3f8e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 32 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants