Skip to content

Conversation

@no23reason
Copy link
Contributor

@no23reason no23reason commented Nov 7, 2025

Rationale for this change

After the changes in #47115, GeneratorStreams backed by anything else than RecordBatches failed. This includes Tables and RecordBatchReaders.

This was caused by a too strict assumption that the RecordBatchStream#GetSchemaPayload would always get called, which is not the case when the GeneratorStream is backed by a Table or a RecordBatchReader.

What changes are included in this PR?

Removal of the problematic assertion and initialization of the writer object when it is needed first.
Also, to accommodate for this case, drop the incoming message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH or DICTIONARY_BATCH one.

Are these changes tested?

Yes, via CI. Tests for the GeneratorStreams were extended so that they test GeneratorStreams backed by Tables and RecordBatchReaders, not just RecordBatches.

Are there any user-facing changes?

No, just a fix for a regression restoring the functionality from version 21.0.0 and earlier.

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

⚠️ GitHub issue #48076 has been automatically assigned in GitHub to PR creator.

@no23reason
Copy link
Contributor Author

Unfortunately, there is more at play than I expected and I am not sure how to proceed :( but I believe the tests I added should be included in upstream and should be passing so that all the possible types backing the GeneratorStream are tested.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for raising the issue and working on this! @no23reason

if (!writer_) {
return Status::UnknownError(
"Writer should be initialized before reading Next batches");
RETURN_NOT_OK(InitializeWriter());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this specific scenario we might have to drop the first message after calling writer_->WriteRecordBatch, which will be a schema message, after creation of the ipc::internal::OpenRecordBatchWriter and we want to keep the rest of the messages which should be the expected RecordBatch for those cases.
I would have to debug but from what I understand that might be the cause of the current problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, that is a good idea. I will take a look as soon as I can, unfortunately something more urgent cropped up I have to deal with first :( I will let you know what I find out

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 7, 2025
@no23reason no23reason force-pushed the gh-48076-generator-stream branch from c11ea33 to 72d4258 Compare November 10, 2025 15:33
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 10, 2025
After the changes in apache#47115, GeneratorStreams backed by anything else
than RecordBatches failed. This includes Tables and RecordBatchReaders.

This was caused by a too strict assumption that the
RecordBatchStream#GetSchemaPayload would always get called, which
is not the case when the GeneratorStream is backed by a Table
or a RecordBatchReader.

So to fix this, remove the assertion and instead initialize the writer
on first access. Also, to accommodate for this case, drop the incoming
message when initializing the writer in Next, as the message there
is of the SCHEMA type and we want RECORD_BATCH one.
@no23reason no23reason force-pushed the gh-48076-generator-stream branch from 72d4258 to 921e6b7 Compare November 10, 2025 16:10
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current CI failures are unrelated. Thanks for the fix! It seems dropping the schema message on those cases made the trick.
This looks good to me. @lidavidm what do you think?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 10, 2025
@no23reason
Copy link
Contributor Author

Thank you for guiding me through it! :)

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A schema message may be followed by dictionary batches - so in this context would that effectively be treated as dictionary replacement? Or would those batches be unexpected?

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting merge Awaiting merge awaiting changes Awaiting changes labels Nov 11, 2025
@no23reason
Copy link
Contributor Author

no23reason commented Nov 11, 2025

@lidavidm I added some more tests that use dictionaries and they seem to be passing. I ran a debugger on one of them and saw a DICTIONARY_BATCH coming in and being returned from Next after skipping the SCHEMA one.

So Next "returning" a dictionary batch seems to be supported out of the box and my change preserves them.

I am not sure it answers your question, though.

To ensure the changes in the previous commit do not break
dictionaries, add more tests with data using them.
@no23reason no23reason force-pushed the gh-48076-generator-stream branch from 6014860 to 1c024c7 Compare November 11, 2025 15:33
@no23reason
Copy link
Contributor Author

I am not sure whether I should do a rebase to try fixing the remaining failures (they all seem SQL/Substrait related), but I do not want to waste the CI resources 🤔

@raulcd
Copy link
Member

raulcd commented Nov 13, 2025

The CI failures are unrelated see:

We are waiting for a new version of protobuf (33.1) released yesterday to be available on homebrew. It is expected to fix the problem.

@no23reason
Copy link
Contributor Author

Right, thank you for the details, so I will wait until that is resolved and then do a rebase, so that we have "all green" CI :)

@raulcd raulcd requested a review from lidavidm November 13, 2025 09:52
@raulcd
Copy link
Member

raulcd commented Nov 13, 2025

Right, thank you for the details, so I will wait until that is resolved and then do a rebase, so that we have "all green" CI :)

I don't think is strictly required, once @lidavidm can take another look, if he's happy we can merge as we know those CI failures are unrelated. Nevertheless if those are fixed before a review, a rebase to have a green CI is always welcome :)

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lidavidm lidavidm merged commit f5b3fc7 into apache:main Nov 14, 2025
42 of 45 checks passed
@lidavidm lidavidm removed the awaiting change review Awaiting change review label Nov 14, 2025
@github-actions github-actions bot added the awaiting merge Awaiting merge label Nov 14, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f5b3fc7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants