Skip to content

DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042

Merged
cgivre merged 1 commit intoapache:masterfrom
rymarm:DRILL-8545
Mar 30, 2026
Merged

DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042
cgivre merged 1 commit intoapache:masterfrom
rymarm:DRILL-8545

Conversation

@rymarm
Copy link
Copy Markdown
Member

@rymarm rymarm commented Mar 23, 2026

DRILL-8545: COLLECT_TO_LIST_VARCHAR function returns incorrect result when Hash Aggregator operator used

Description

Root cause

The collect_to_list_varchar function is incompatible with the Hash Aggregator because the aggregator processes data in a non-sequential manner, while the underlying ValueVector framework requires sequential writes for variable-length data. Furthermore, the Drill UDF framework lacks a straightforward mechanism to buffer these values internally before flushing them to the output vector, making it impossible to reorder them on the fly during the aggregation phase.
Solution

Solution

To ensure data integrity and prevent index out-of-bounds exceptions, I have modified the Hash Aggregator physical planning rule. The planner will now explicitly disallow the Hash Aggregator if a collect_to_list_varchar call is detected in the aggregate expression. This forces the optimizer to fall back to the Streaming Aggregator, which provides the necessary ordered input.

Documentation

No changes.

Testing

Updated the available unit test cases so they cover the mentioned problem.

@rymarm rymarm requested review from cgivre and jnturton March 23, 2026 19:22
Copy link
Copy Markdown
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rymarm Thank you for this enhancement. I'm fine with the changes, but I have a question before I approve. If a Drill cluster is configured with HashAgg enabled, and a user executes a query with collect_to_list_varchar, will Drill fall back to StreamAgg or will Drill throw an error? I was a little confused by the unit tests.

@cgivre cgivre added minor-update performance PRs that Improve Performance labels Mar 29, 2026
@rymarm
Copy link
Copy Markdown
Member Author

rymarm commented Mar 30, 2026

@cgivre Hi Charles!

If a Drill cluster is configured with HashAgg enabled, and a user executes a query with collect_to_list_varchar, will Drill fall back to StreamAgg or will Drill throw an error?

Yes, definitely. If a Drill cluster is configured with HashAgg enabled and StreamAgg is enabled either (by default, both HashAgg and StreamAgg operators are enabled), Drill will simply fallback to the StreamAgg operator. Otherwise, if HashAgg is enabled and StreamAgg is DISABLED - Drill will throw an exception CannotPlanException, because Drill has only 2 aggregation operator implementations at all, and in this case, one is not acceptable, and another one is disabled.

Copy link
Copy Markdown
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM
Thanks @rymarm for this!

@cgivre cgivre merged commit 3c3238c into apache:master Mar 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor-update performance PRs that Improve Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants