Skip to content

Conversation

torgebo
Copy link

@torgebo torgebo commented Oct 18, 2025

This is done by iterating over the file set.
We check that the schemas agree before concatenating.

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Are these changes tested?

Tested manually by checking file outputs.

Are there any user-facing changes?

No changes to CLI.

This is done by iterating over the file set.
We check that the schemas agree before concatenating.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 18, 2025
@tustvold
Copy link
Contributor

How large are these files that are being concatenated? I ask as parquet-concat copies row groups as is, however, ideally row groups should be on the order of 1MB per column. I worry this might just move the problem and generate very degenerate parquet files...

@torgebo
Copy link
Author

torgebo commented Oct 19, 2025

Hi, good point.

It is true that the proposed change does not affect the ouput row group distribution size. So if you pass in a "degenerate" dataset to parquet-concat, its output file should present those same degeneracies.

It might be that with greater power, comes greater responsibility. I don't see that as a strong argument to not make our tools more powerful. Indeed, if there is one true way of doing compute, you would likely not need a tool like parquet-concat.

The suggested change brings the behaviour of parquet-concat closer to that of the traditional cat Unix tool, by handling the files as a "stream". Linux ulimit is as low as 1024 or even lower on many systems. Many compute professionals (e.g. university professionals) are using (time sharing) systems where they might not have control over the system settings (or they might need to reserve the file descriptors to other use). It seems reasonable to let them concatenate their parquet files even so.

@tustvold
Copy link
Contributor

tustvold commented Oct 19, 2025

Lets say you concatenate 1000 files, each with 10 columns. In order for the output to not be degenerate (have column chunks smaller than 1MB) it would mean a parquet file of at least 10GB.

Or to phrase it more explicitly, this tool is not meant for concatenating large numbers of small files into larger files. You would need to use a different tool that also rewrites the actual row group contents.

As an aside your comment reads a lot like something written by ChatGPT, it's generally courteous to disclose usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parquet-concat - handle large number of files

2 participants