DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042
DRILL-8545: Disable HashAgg for collect_to_list_varchar due to ordering requirements#3042cgivre merged 1 commit intoapache:masterfrom
Conversation
cgivre
left a comment
There was a problem hiding this comment.
@rymarm Thank you for this enhancement. I'm fine with the changes, but I have a question before I approve. If a Drill cluster is configured with HashAgg enabled, and a user executes a query with collect_to_list_varchar, will Drill fall back to StreamAgg or will Drill throw an error? I was a little confused by the unit tests.
|
@cgivre Hi Charles!
Yes, definitely. If a Drill cluster is configured with |
DRILL-8545: COLLECT_TO_LIST_VARCHAR function returns incorrect result when Hash Aggregator operator used
Description
Root cause
The
collect_to_list_varcharfunction is incompatible with the Hash Aggregator because the aggregator processes data in a non-sequential manner, while the underlyingValueVectorframework requires sequential writes for variable-length data. Furthermore, the Drill UDF framework lacks a straightforward mechanism to buffer these values internally before flushing them to the output vector, making it impossible to reorder them on the fly during the aggregation phase.Solution
Solution
To ensure data integrity and prevent index out-of-bounds exceptions, I have modified the Hash Aggregator physical planning rule. The planner will now explicitly disallow the Hash Aggregator if a
collect_to_list_varcharcall is detected in the aggregate expression. This forces the optimizer to fall back to the Streaming Aggregator, which provides the necessary ordered input.Documentation
No changes.
Testing
Updated the available unit test cases so they cover the mentioned problem.