parquet-concat: handle large number of files. #8651

torgebo · 2025-10-18T18:53:34Z

This is done by iterating over the file set.
We check that the schemas agree before concatenating.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Closes parquet-concat - handle large number of files #8650 .

Are these changes tested?

Tested manually by checking file outputs.

Are there any user-facing changes?

No changes to CLI.

This is done by iterating over the file set. We check that the schemas agree before concatenating.

tustvold · 2025-10-19T00:01:16Z

How large are these files that are being concatenated? I ask as parquet-concat copies row groups as is, however, ideally row groups should be on the order of 1MB per column. I worry this might just move the problem and generate very degenerate parquet files...

torgebo · 2025-10-19T15:59:18Z

Hi, good point.

It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to parquet-concat, its output file should present those same degeneracies.

It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is one true way of doing compute, you would likely not need a tool like parquet-concat.

The suggested change brings the behaviour of parquet-concat closer to that of the traditional cat Unix tool, by handling the files as a "stream". Linux ulimit is as low as 1024 or even lower on many systems. Many compute professionals (e.g. university professionals) are using (time sharing) systems where they might not have control over the system settings (or they might need to reserve the file descriptors to other use). It seems reasonable to let them concatenate their parquet files even so.

tustvold · 2025-10-19T16:19:33Z

Lets say you concatenate 1000 files, each with 10 columns. In order for the output to not be degenerate (have column chunks smaller than 1MB) it would mean a parquet file of at least 10GB.

Or to phrase it more explicitly, this tool is not meant for concatenating large numbers of small files into larger files. You would need to use a different tool that also rewrites the actual row group contents.

As an aside your comment reads a lot like something written by ChatGPT, it's generally courteous to disclose usage

parquet-concat: handle large number of files.

ce840d9

This is done by iterating over the file set. We check that the schemas agree before concatenating.

github-actions bot added the parquet Changes to the parquet crate label Oct 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parquet-concat: handle large number of files. #8651

parquet-concat: handle large number of files. #8651

Uh oh!

torgebo commented Oct 18, 2025

Uh oh!

tustvold commented Oct 19, 2025

Uh oh!

torgebo commented Oct 19, 2025 •

edited

Loading

Uh oh!

tustvold commented Oct 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parquet-concat: handle large number of files. #8651

Are you sure you want to change the base?

parquet-concat: handle large number of files. #8651

Uh oh!

Conversation

torgebo commented Oct 18, 2025

Which issue does this PR close?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

tustvold commented Oct 19, 2025

Uh oh!

torgebo commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

torgebo commented Oct 19, 2025 •

edited

Loading

tustvold commented Oct 19, 2025 •

edited

Loading