-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48076: [C++][Flight] fix GeneratorStream for Tables #48082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
ef7f961 to
c11ea33
Compare
|
Unfortunately, there is more at play than I expected and I am not sure how to proceed :( but I believe the tests I added should be included in upstream and should be passing so that all the possible types backing the GeneratorStream are tested. |
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for raising the issue and working on this! @no23reason
| if (!writer_) { | ||
| return Status::UnknownError( | ||
| "Writer should be initialized before reading Next batches"); | ||
| RETURN_NOT_OK(InitializeWriter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this specific scenario we might have to drop the first message after calling writer_->WriteRecordBatch, which will be a schema message, after creation of the ipc::internal::OpenRecordBatchWriter and we want to keep the rest of the messages which should be the expected RecordBatch for those cases.
I would have to debug but from what I understand that might be the cause of the current problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion, that is a good idea. I will take a look as soon as I can, unfortunately something more urgent cropped up I have to deal with first :( I will let you know what I find out
c11ea33 to
72d4258
Compare
After the changes in apache#47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders. This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader. So to fix this, remove the assertion and instead initialize the writer on first access. Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there is of the SCHEMA type and we want RECORD_BATCH one.
72d4258 to
921e6b7
Compare
raulcd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current CI failures are unrelated. Thanks for the fix! It seems dropping the schema message on those cases made the trick.
This looks good to me. @lidavidm what do you think?
|
Thank you for guiding me through it! :) |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A schema message may be followed by dictionary batches - so in this context would that effectively be treated as dictionary replacement? Or would those batches be unexpected?
|
@lidavidm I added some more tests that use dictionaries and they seem to be passing. I ran a debugger on one of them and saw a DICTIONARY_BATCH coming in and being returned from So I am not sure it answers your question, though. |
To ensure the changes in the previous commit do not break dictionaries, add more tests with data using them.
6014860 to
1c024c7
Compare
|
I am not sure whether I should do a rebase to try fixing the remaining failures (they all seem SQL/Substrait related), but I do not want to waste the CI resources 🤔 |
|
The CI failures are unrelated see: We are waiting for a new version of protobuf (33.1) released yesterday to be available on homebrew. It is expected to fix the problem. |
|
Right, thank you for the details, so I will wait until that is resolved and then do a rebase, so that we have "all green" CI :) |
I don't think is strictly required, once @lidavidm can take another look, if he's happy we can merge as we know those CI failures are unrelated. Nevertheless if those are fixed before a review, a rebase to have a green CI is always welcome :) |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f5b3fc7. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
After the changes in #47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders.
This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader.
What changes are included in this PR?
Removal of the problematic assertion and initialization of the writer object when it is needed first.
Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH or DICTIONARY_BATCH one.
Are these changes tested?
Yes, via CI. Tests for the GeneratorStreams were extended so that they test GeneratorStreams backed by Tables and RecordBatchReaders, not just RecordBatches.
Are there any user-facing changes?
No, just a fix for a regression restoring the functionality from version 21.0.0 and earlier.