Skip to content

perf: implement convert_to_state for SparkAvg#21548

Open
azhangd wants to merge 1 commit intoapache:mainfrom
azhangd:spark-avg-skip-partial-agg
Open

perf: implement convert_to_state for SparkAvg#21548
azhangd wants to merge 1 commit intoapache:mainfrom
azhangd:spark-avg-skip-partial-agg

Conversation

@azhangd
Copy link
Copy Markdown

@azhangd azhangd commented Apr 11, 2026

Which issue does this PR close?

Rationale for this change

SparkAvg's AvgGroupsAccumulator doesn't implement supports_convert_to_state (defaults to false), which prevents the skip-partial-aggregation optimization from kicking in for queries that use Spark's avg().

I ran into this while benchmarking a Spark Connect engine built on DataFusion. On TPC-H q17 at SF10, the partial aggregate for avg(l_quantity) grouped by l_partkey (~2M groups out of 60M rows) was not triggering skip-aggregation:

Metric Without convert_to_state With convert_to_state
Partial aggregate memory 923 MB 40 MB
Partial aggregate elapsed 4.75s 109ms

The skip-aggregation probe (#11627) detects when a partial aggregate isn't reducing cardinality and falls back to passing rows through as state directly. This needs convert_to_state so the accumulator can produce [sum, count] state arrays from raw input. The built-in Avg already has this (#11734), but it wasn't carried over when SparkAvg was migrated from Comet in #17871.

What changes are included in this PR?

Adds convert_to_state() and supports_convert_to_state() to AvgGroupsAccumulator in datafusion-spark.

Follows the same approach as the built-in Avg, adapted for SparkAvg's differences:

  • State order is [sum, count] (vs [count, sum] in the built-in)
  • Count type is Int64 (vs UInt64 in the built-in)
  • Null handling uses NullBuffer::union directly instead of pulling in datafusion-functions-aggregate-common as a dep

Also cleaned up the fully-qualified arrow::array::BooleanArray references in update_batch / merge_batch since adding BooleanArray to the import block triggered the unused_qualifications lint.

Are these changes tested?

Yes, unit tests covering basic conversion, null propagation, filter handling, and a roundtrip through merge_batch to verify the converted state produces correct results end-to-end.

Are there any user-facing changes?

No. Queries using avg() through the Spark function registry will automatically benefit from skip-partial-aggregation on high-cardinality groupings.

@github-actions github-actions bot added the spark label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant