-
Notifications
You must be signed in to change notification settings - Fork 980
Use int64_t for the num_rows slot in parquet_reader_options #20256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use int64_t for the num_rows slot in parquet_reader_options #20256
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need docs somewhere that explain the limitation (when reading, the row count must fit into size_type
)? Maybe adding test for that error would be good. Otherwise LGTM.
We need to add the following check here to make sure we don't try to read more than 2B rows with cudf/cpp/src/io/parquet/reader_impl.cpp Lines 877 to 884 in 7bbc981
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add a check making sure we don't try to read more than 2B rows with the read_parquet()
API.
Maybe adding test for that error would be good.
I think adding a test as well would also be a good idea.
Although reading a single table of more than size_type::max() rows is not possible, we may wish to specify options to read greater than size_type::max() rows from a large parquet file and require that the consumer of the options splits them up appropriately. This is useful in chunked/streaming pipelines because we do not have to recreate the read_parquet options API and can instead get the caller to provide an options object that we pick apart. Internally the reader implementation already uses int64 for num_rows, so this is really just a (breaking) interface change.
4b1a6cf
to
ba89ada
Compare
Done. |
Done, but handling the case that the |
Thanks. I don't really see the check in the Also, in the test, can we read the same file (with same num_rows) using the chunked parquet reader (with finite and zero chunk and pass read limits) and make sure it does not throw. Thanks for working on this again! |
Ah sorry, too many in flight branches, done now. |
Thanks. This looks good to me, I can approve as soon as we add the chunked reader in the new test as well (maybe we can move this test to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor stuff
I think I did this. |
…ead-parquet-int64-num-rows
…ead-parquet-int64-num-rows
a2c1b9f
to
79d8de7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating on this. LGTM!
/merge |
Description
Although reading a single table of more than size_type::max() rows is not possible, we may wish to specify options to read greater than size_type::max() rows from a large parquet file and require that the consumer of the options splits them up appropriately.
This is useful in chunked/streaming pipelines because we do not have to recreate the read_parquet options API and can instead get the caller to provide an options object that we pick apart.
Internally the reader implementation already uses int64 for num_rows, so this is really just a (breaking) interface change.
Checklist