Use int64_t for the num_rows slot in parquet_reader_options #20256

wence- · 2025-10-14T16:55:22Z

Description

Although reading a single table of more than size_type::max() rows is not possible, we may wish to specify options to read greater than size_type::max() rows from a large parquet file and require that the consumer of the options splits them up appropriately.

This is useful in chunked/streaming pipelines because we do not have to recreate the read_parquet options API and can instead get the caller to provide an options object that we pick apart.

Internally the reader implementation already uses int64 for num_rows, so this is really just a (breaking) interface change.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice

Do we need docs somewhere that explain the limitation (when reading, the row count must fit into size_type)? Maybe adding test for that error would be good. Otherwise LGTM.

mhaseeb123 · 2025-10-14T17:03:10Z

We need to add the following check here to make sure we don't try to read more than 2B rows with read_parquet API:

cudf/cpp/src/io/parquet/reader_impl.cpp

Lines 877 to 884 in 7bbc981

    
           table_with_metadata reader_impl::read() 
        
           { 
        
             CUDF_EXPECTS(_output_chunk_read_limit == 0, 
        
                          "Reading the whole file must not have non-zero byte_limit."); 
        
             prepare_data(read_mode::READ_ALL); 
        
             return read_chunk_internal(read_mode::READ_ALL); 
        
           }

CUDF_EXPECTS(std::cmp_less_equal(options.get_num_rows(), std::numeric_limits<cudf::size_type>::max(), "Reading the whole file at once must not read more rows than cudf's column size limits");

mhaseeb123

We need to add a check making sure we don't try to read more than 2B rows with the read_parquet() API.

Maybe adding test for that error would be good.

I think adding a test as well would also be a good idea.

Although reading a single table of more than size_type::max() rows is not possible, we may wish to specify options to read greater than size_type::max() rows from a large parquet file and require that the consumer of the options splits them up appropriately. This is useful in chunked/streaming pipelines because we do not have to recreate the read_parquet options API and can instead get the caller to provide an options object that we pick apart. Internally the reader implementation already uses int64 for num_rows, so this is really just a (breaking) interface change.

…rows

wence- · 2025-10-15T13:47:13Z

We need to add a check making sure we don't try to read more than 2B rows with the read_parquet() API.

Maybe adding test for that error would be good.

I think adding a test as well would also be a good idea.

Done.

wence- · 2025-10-15T13:47:51Z

We need to add the following check here to make sure we don't try to read more than 2B rows with read_parquet API:

cudf/cpp/src/io/parquet/reader_impl.cpp

Lines 877 to 884 in 7bbc981

table_with_metadata reader_impl::read()

{

CUDF_EXPECTS(_output_chunk_read_limit == 0,

"Reading the whole file must not have non-zero byte_limit.");

prepare_data(read_mode::READ_ALL);

return read_chunk_internal(read_mode::READ_ALL);

}
CUDF_EXPECTS(std::cmp_less_equal(options.get_num_rows(), std::numeric_limits<cudf::size_type>::max(), "Reading the whole file at once must not read more rows than cudf's column size limits");

Done, but handling the case that the num_rows is nullopt.

mhaseeb123 · 2025-10-15T17:26:05Z

Done, but handling the case that the num_rows is nullopt.

Thanks. I don't really see the check in the reader_impl::read() currently. Perhaps a commit hasn't been pushed yet? I know that the read_parquet() does catch num_rows() being > 2B eventually while selecting row groups to read, but I would rather catch it early (right at the start of reader_impl::read()) and throw.

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks for working on this again!

wence- · 2025-10-15T17:28:35Z

Done, but handling the case that the num_rows is nullopt.

Thanks. I don't really see the check in the reader_impl::read() currently. Perhaps a commit hasn't been pushed yet? I know that the read_parquet() will catch num_rows() being > 2B eventually while selecting row groups to read, but I would rather catch it early (right in reader_impl::read() and throw.

Ah sorry, too many in flight branches, done now.

mhaseeb123 · 2025-10-15T18:09:18Z

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks. This looks good to me, I can approve as soon as we add the chunked reader in the new test as well (maybe we can move this test to parquet_chunked_reader_test.cu if needed)

cpp/src/io/parquet/reader_impl.cpp

mhaseeb123

Minor stuff

cpp/include/cudf/io/parquet.hpp

python/pylibcudf/pylibcudf/io/parquet.pyx

…ead-parquet-int64-num-rows

wence- · 2025-10-16T14:00:51Z

Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw.

Thanks. This looks good to me, I can approve as soon as we add the chunked reader in the new test as well (maybe we can move this test to parquet_chunked_reader_test.cu if needed)

I think I did this.

…ead-parquet-int64-num-rows

cpp/tests/io/parquet_reader_test.cpp

mhaseeb123

Thank you for iterating on this. LGTM!

wence- · 2025-10-22T08:07:54Z

/merge

wence- requested review from a team as code owners October 14, 2025 16:55

wence- requested review from Matt711, TomAugspurger, bdice and shrshi October 14, 2025 16:55

wence- added improvement Improvement / enhancement to an existing function breaking Breaking change labels Oct 14, 2025

github-actions bot assigned wence- Oct 14, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Oct 14, 2025

github-project-automation bot added this to cuDF Python Oct 14, 2025

GPUtester moved this to In Progress in cuDF Python Oct 14, 2025

bdice approved these changes Oct 14, 2025

View reviewed changes

mhaseeb123 requested changes Oct 14, 2025

View reviewed changes

wence- added 2 commits October 15, 2025 09:50

Test that trying to read more than size_type::max() rows in one go th…

ba89ada

…rows

wence- force-pushed the wence/fea/read-parquet-int64-num-rows branch from 4b1a6cf to ba89ada Compare October 15, 2025 13:46

Throw in reader_impl::read

52d640f

mhaseeb123 reviewed Oct 15, 2025

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Oct 15, 2025

View reviewed changes

cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved

python/pylibcudf/pylibcudf/io/parquet.pyx Show resolved Hide resolved

wence- added 2 commits October 16, 2025 13:59

Docstring/wording fixes and chunked reader test

e5d190b

Merge remote-tracking branch 'upstream/branch-25.12' into wence/fea/r…

2ac8fad

…ead-parquet-int64-num-rows

Merge branch 'branch-25.12' into wence/fea/read-parquet-int64-num-rows

c66cf49

wence- mentioned this pull request Oct 17, 2025

Introduce a streaming read_parquet node rapidsai/rapidsmpf#574

Open

wence- added 2 commits October 17, 2025 13:29

Fix tests

cfa4a47

Merge remote-tracking branch 'upstream/branch-25.12' into wence/fea/r…

f350a1a

…ead-parquet-int64-num-rows

wence- requested a review from mhaseeb123 October 20, 2025 09:35

Merge remote-tracking branch 'upstream/branch-25.12' into wence/fea/r…

bf40d01

…ead-parquet-int64-num-rows

mhaseeb123 reviewed Oct 20, 2025

View reviewed changes

cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Oct 20, 2025

View reviewed changes

cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved

Simplify test

79d8de7

wence- force-pushed the wence/fea/read-parquet-int64-num-rows branch from a2c1b9f to 79d8de7 Compare October 21, 2025 13:02

mhaseeb123 approved these changes Oct 21, 2025

View reviewed changes

rapids-bot bot merged commit 1b4b981 into rapidsai:main Oct 22, 2025
137 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Oct 22, 2025

wence- deleted the wence/fea/read-parquet-int64-num-rows branch October 22, 2025 08:08

Use int64_t for the num_rows slot in parquet_reader_options #20256

Use int64_t for the num_rows slot in parquet_reader_options #20256

Uh oh!

Conversation

wence- commented Oct 14, 2025

Description

Checklist

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhaseeb123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wence- commented Oct 15, 2025

Uh oh!

wence- commented Oct 15, 2025

Uh oh!

mhaseeb123 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- commented Oct 15, 2025

Uh oh!

mhaseeb123 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wence- commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

wence- commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mhaseeb123 commented Oct 14, 2025 •

edited

Loading

mhaseeb123 left a comment •

edited

Loading

mhaseeb123 commented Oct 15, 2025 •

edited

Loading

mhaseeb123 commented Oct 15, 2025 •

edited

Loading