perf: implement convert_to_state for SparkAvg by azhangd · Pull Request #21548 · apache/datafusion

azhangd · 2026-04-11T07:19:58Z

Which issue does this PR close?

Part of Deduplicate Spark function code with native/default datafusion function code #17964.

Rationale for this change

SparkAvg's AvgGroupsAccumulator doesn't implement supports_convert_to_state (defaults to false), which prevents the skip-partial-aggregation optimization from kicking in for queries that use Spark's avg().

I ran into this while benchmarking a Spark Connect engine built on DataFusion. On TPC-H q17 at SF10, the partial aggregate for avg(l_quantity) grouped by l_partkey (~2M groups out of 60M rows) was not triggering skip-aggregation:

Metric	Without convert_to_state	With convert_to_state
Partial aggregate memory	923 MB	40 MB
Partial aggregate elapsed	4.75s	109ms

The skip-aggregation probe (#11627) detects when a partial aggregate isn't reducing cardinality and falls back to passing rows through as state directly. This needs convert_to_state so the accumulator can produce [sum, count] state arrays from raw input. The built-in Avg already has this (#11734), but it wasn't carried over when SparkAvg was migrated from Comet in #17871.

What changes are included in this PR?

Adds convert_to_state() and supports_convert_to_state() to AvgGroupsAccumulator in datafusion-spark.

Follows the same approach as the built-in Avg, adapted for SparkAvg's differences:

State order is [sum, count] (vs [count, sum] in the built-in)
Count type is Int64 (vs UInt64 in the built-in)
Null handling uses NullBuffer::union directly instead of pulling in datafusion-functions-aggregate-common as a dep

Also cleaned up the fully-qualified arrow::array::BooleanArray references in update_batch / merge_batch since adding BooleanArray to the import block triggered the unused_qualifications lint.

Are these changes tested?

Yes, unit tests covering basic conversion, null propagation, filter handling, and a roundtrip through merge_batch to verify the converted state produces correct results end-to-end.

Are there any user-facing changes?

No. Queries using avg() through the Spark function registry will automatically benefit from skip-partial-aggregation on high-cardinality groupings.

perf: implement convert_to_state for SparkAvg

902c4ea

github-actions bot added the spark label Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: implement convert_to_state for SparkAvg#21548

perf: implement convert_to_state for SparkAvg#21548
azhangd wants to merge 1 commit intoapache:mainfrom
azhangd:spark-avg-skip-partial-agg

azhangd commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azhangd commented Apr 11, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant