Is your feature request related to a problem or challenge?
The existing sort_pushdown_sorted benchmark covers the Exact path (sort elimination, scan limit). However, the Inexact path optimizations — reverse scan (#19064) and row group reorder by statistics (#21580) — are not benchmarked.
Without an Inexact benchmark, we can't:
Describe the solution you'd like
Extend benchmarks/bench.sh and queries under benchmarks/queries/sort_pushdown/ to add Inexact scenarios:
-
Data: Generate a single large file with multiple row groups where row groups have overlapping or out-of-order statistics (forces Inexact path). Can be done by:
- Writing data in non-sorted order with small
max_row_group_size
- Creating synthetic data with controlled row group boundaries
-
Queries (benchmarks/queries/sort_pushdown/q5.sql, q6.sql, ...):
SELECT * FROM t ORDER BY col ASC LIMIT 10 — TopK + RG reorder
SELECT * FROM t ORDER BY col DESC LIMIT 10 — TopK + reverse scan + RG reorder
SELECT * FROM t ORDER BY col ASC LIMIT 1000 — larger LIMIT
- Wide-row variant:
SELECT * with many columns to show row-level filter benefit
-
Baseline comparison: With/without datafusion.optimizer.enable_sort_pushdown to isolate the optimization's impact.
Additional context
Is your feature request related to a problem or challenge?
The existing
sort_pushdown_sortedbenchmark covers the Exact path (sort elimination, scan limit). However, the Inexact path optimizations — reverse scan (#19064) and row group reorder by statistics (#21580) — are not benchmarked.Without an Inexact benchmark, we can't:
Describe the solution you'd like
Extend
benchmarks/bench.shand queries underbenchmarks/queries/sort_pushdown/to add Inexact scenarios:Data: Generate a single large file with multiple row groups where row groups have overlapping or out-of-order statistics (forces Inexact path). Can be done by:
max_row_group_sizeQueries (
benchmarks/queries/sort_pushdown/q5.sql,q6.sql, ...):SELECT * FROM t ORDER BY col ASC LIMIT 10— TopK + RG reorderSELECT * FROM t ORDER BY col DESC LIMIT 10— TopK + reverse scan + RG reorderSELECT * FROM t ORDER BY col ASC LIMIT 1000— larger LIMITSELECT *with many columns to show row-level filter benefitBaseline comparison: With/without
datafusion.optimizer.enable_sort_pushdownto isolate the optimization's impact.Additional context