Skip to content

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 14, 2025

Description

Although reading a single table of more than size_type::max() rows is not possible, we may wish to specify options to read greater than size_type::max() rows from a large parquet file and require that the consumer of the options splits them up appropriately.

This is useful in chunked/streaming pipelines because we do not have to recreate the read_parquet options API and can instead get the caller to provide an options object that we pick apart.

Internally the reader implementation already uses int64 for num_rows, so this is really just a (breaking) interface change.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested review from a team as code owners October 14, 2025 16:55
@wence- wence- added improvement Improvement / enhancement to an existing function breaking Breaking change labels Oct 14, 2025
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Oct 14, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Oct 14, 2025
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need docs somewhere that explain the limitation (when reading, the row count must fit into size_type)? Maybe adding test for that error would be good. Otherwise LGTM.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Oct 14, 2025

We need to add the following check here to make sure we don't try to read more than 2B rows with read_parquet API:

table_with_metadata reader_impl::read()
{
CUDF_EXPECTS(_output_chunk_read_limit == 0,
"Reading the whole file must not have non-zero byte_limit.");
prepare_data(read_mode::READ_ALL);
return read_chunk_internal(read_mode::READ_ALL);
}

CUDF_EXPECTS(std::cmp_less_equal(options.get_num_rows(), std::numeric_limits<cudf::size_type>::max(), "Reading the whole file at once must not read more rows than cudf's column size limits");

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a check making sure we don't try to read more than 2B rows with the read_parquet() API.

Maybe adding test for that error would be good.

I think adding a test as well would also be a good idea.

Although reading a single table of more than size_type::max() rows is not
possible, we may wish to specify options to read greater than
size_type::max() rows from a large parquet file and require that the
consumer of the options splits them up appropriately.

This is useful in chunked/streaming pipelines because we do not have to
recreate the read_parquet options API and can instead get the caller to
provide an options object that we pick apart.

Internally the reader implementation already uses int64 for num_rows, so
this is really just a (breaking) interface change.
@wence- wence- force-pushed the wence/fea/read-parquet-int64-num-rows branch from 4b1a6cf to ba89ada Compare October 15, 2025 13:46
@wence-
Copy link
Contributor Author

wence- commented Oct 15, 2025

We need to add a check making sure we don't try to read more than 2B rows with the read_parquet() API.

Maybe adding test for that error would be good.

I think adding a test as well would also be a good idea.

Done.

@wence-
Copy link
Contributor Author

wence- commented Oct 15, 2025

We need to add the following check here to make sure we don't try to read more than 2B rows with read_parquet API:

table_with_metadata reader_impl::read()
{
CUDF_EXPECTS(_output_chunk_read_limit == 0,
"Reading the whole file must not have non-zero byte_limit.");
prepare_data(read_mode::READ_ALL);
return read_chunk_internal(read_mode::READ_ALL);
}

CUDF_EXPECTS(std::cmp_less_equal(options.get_num_rows(), std::numeric_limits<cudf::size_type>::max(), "Reading the whole file at once must not read more rows than cudf's column size limits");

Done, but handling the case that the num_rows is nullopt.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Oct 15, 2025

Done, but handling the case that the num_rows is nullopt.

Thanks. I don't really see the check in the reader_impl::read() currently. Perhaps a commit hasn't been pushed yet? I know that the read_parquet() does catch num_rows() being > 2B eventually while selecting row groups to read, but I would rather catch it early (right at the start of reader_impl::read()) and throw.

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks for working on this again!

@wence-
Copy link
Contributor Author

wence- commented Oct 15, 2025

Done, but handling the case that the num_rows is nullopt.

Thanks. I don't really see the check in the reader_impl::read() currently. Perhaps a commit hasn't been pushed yet? I know that the read_parquet() will catch num_rows() being > 2B eventually while selecting row groups to read, but I would rather catch it early (right in reader_impl::read() and throw.

Ah sorry, too many in flight branches, done now.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Oct 15, 2025

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks. This looks good to me, I can approve as soon as we add the chunked reader in the new test as well (maybe we can move this test to parquet_chunked_reader_test.cu if needed)

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff

@wence-
Copy link
Contributor Author

wence- commented Oct 16, 2025

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks. This looks good to me, I can approve as soon as we add the chunked reader in the new test as well (maybe we can move this test to parquet_chunked_reader_test.cu if needed)

I think I did this.

@wence- wence- requested a review from mhaseeb123 October 20, 2025 09:35
@wence- wence- force-pushed the wence/fea/read-parquet-int64-num-rows branch from a2c1b9f to 79d8de7 Compare October 21, 2025 13:02
Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on this. LGTM!

@wence-
Copy link
Contributor Author

wence- commented Oct 22, 2025

/merge

@rapids-bot rapids-bot bot merged commit 1b4b981 into rapidsai:main Oct 22, 2025
137 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Oct 22, 2025
@wence- wence- deleted the wence/fea/read-parquet-int64-num-rows branch October 22, 2025 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants