perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel by mbutrovich · Pull Request #21600 · apache/datafusion

mbutrovich · 2026-04-13T22:11:45Z

Draft for benchmarking, builds on #21525.

Which issue does this PR close?

Partially addresses #21543.

Rationale for this change

ExternalSorter's merge path sorts each incoming batch individually (typically 8192 rows), then k-way merges all of them. This creates two problems:

Too many small sorted runs. At scale (TPC-H SF10, ~60M rows in lineitem), ~7300 individually-sorted batches feed the k-way merge with high fan-in.
Radix sort can't amortize encoding. MSD radix sort (feat(arrow-row): add MSD radix sort kernel for row-encoded keys arrow-rs#9683) is 2-3x faster than lexsort_to_indices at 32K+ rows for multi-column sorts, but at 8K rows the RowConverter encoding cost dominates. TPC-H SF10 benchmarks on perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter #21525 confirmed this: naively swapping in radix sort made 12/22 queries slower (up to 1.20x).

What changes are included in this PR?

Chunked sort pipeline

Replaces ExternalSorter's buffer-then-sort architecture with a coalesce-then-sort pipeline:

Incoming batches accumulate in a BatchCoalescer until sort_coalesce_target_rows (default 32768) is reached
Each coalesced batch is sorted (radix or lexsort) and chunked back to batch_size
On memory pressure, sorted runs spill to disk (merged into one file when headroom is available, one file per run otherwise)
At query completion, runs are k-way merged via the existing StreamingMergeBuilder

Uniform coalescing, per-batch algorithm selection

All schemas coalesce to sort_coalesce_target_rows. This reduces merge fan-in for all queries, including single-column sorts like sort-merge join keys.

Per batch, radix sort is used when the schema is eligible (multi-column, primitives/strings) and the batch reached sort_coalesce_target_rows. Otherwise lexsort is used. A sort_use_radix config (default true) allows disabling radix entirely to isolate the pipeline's contribution.

Metrics

New radix_sorted_batches and lexsort_sorted_batches counters in ExternalSorterMetrics, visible in EXPLAIN ANALYZE.

Dead code removal

Sorted runs no longer require an in-memory merge before spilling. Removes in_mem_sort_stream, sort_batch_stream, consume_and_spill_append, spill_finish, organize_stringview_arrays, and in_progress_spill_file.

Config changes

New: sort_coalesce_target_rows (default 32768)
New: sort_use_radix (default true)
Deprecated: sort_in_place_threshold_bytes (no longer read, warn attribute per API health policy)

Are these changes tested?

4 new unit tests (coalescing, partial flush, per-run spill, merged spill)
All 52 sort unit tests pass
All sort fuzz, sort query fuzz, and spilling fuzz tests pass
information_schema.slt updated for new configs

Are there any user-facing changes?

New config sort_coalesce_target_rows (default 32768) controls coalesce target
New config sort_use_radix (default true) enables/disables radix sort
New metrics radix_sorted_batches and lexsort_sorted_batches in EXPLAIN ANALYZE
sort_in_place_threshold_bytes is deprecated
The pipeline is more memory-efficient (shrinks reservations after sorting) so some workloads may spill less frequently

…to choose between sort implementations.

gratus00 · 2026-04-14T12:10:16Z

datafusion/physical-plan/src/sorts/sort.rs

+        } else if self.sorted_runs_memory > reservation_size {
+            self.reservation
+                .grow(self.sorted_runs_memory - reservation_size);
+        }


The comment says it would only exceed the limit by a small amount, but wouldn't this compound across partitions if it's a high amount of them?

Maybe we could still use a try grow here and remedy the failure, or at least cap the grow amount?

I updated the comments to explain the scenario a bit more, but let me know if you still think we should do something more strict.

gratus00 · 2026-04-14T12:24:06Z

datafusion/physical-plan/src/sorts/sort.rs

+        let use_radix_for_this_batch =
+            self.use_radix && batch.num_rows() > self.batch_size;
+


nit: since there is already a gate in sort_batch that handles radix vs lexsort maybe we could change the name of this variable for readability to something like use_chunked_radix?

also just realizing, the "graceful degradation" section says the pipeline falls back to lexsort below batch_size rows, but wouldn't the else branch here still take the radix path? When use_radix_for_this_batch is false, this calls sort_batch, and sort_batch independently checks use_radix_sort when fetch.is_none()

for radix-eligible schemas it takes the radix path regardless of row count. Wouldn't that be slower?

Thanks for the feedback @gratus00! I addressed both of these. Checking per-batch is wasteful too. I created an inner function that we can call directly because sort_batch is a public API and I don't want to change it.

… regression.

- Always coalesce to `sort_coalesce_target_rows` regardless of schema (removed conditional that fell back to `batch_size` for non-radix) - Both radix and lexsort paths now go through `sort_batch_chunked` (both chunk output to `batch_size`) - Per-batch radix decision uses `sort_coalesce_target_rows` as threshold instead of `batch_size` - Added `radix_sorted_batches` and `lexsort_sorted_batches` counters to `ExternalSorterMetrics` - Added `sort_coalesce_target_rows` and `sort_use_radix` config fields to `ExternalSorter` - New `sort_use_radix` parameter gates the `use_radix_sort()` schema check ## `datafusion/common/src/config.rs` - New config: `sort_use_radix: bool, default = true` - Updated `sort_coalesce_target_rows` doc ## `datafusion/execution/src/config.rs` - New builder method: `with_sort_use_radix()` ## `datafusion/core/tests/fuzz_cases/sort_fuzz.rs` - `(20000, false)` → `(50000, true)` to fix flaky test ## `datafusion/sqllogictest/test_files/information_schema.slt` + `docs/source/user-guide/configs.md` - Added `sort_use_radix` entry, updated `sort_coalesce_target_rows` description

adriangbot · 2026-04-14T16:00:20Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4245342572-1236-8dldw 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sort_redesign (73bb06b) to 0143dfe (merge-base) diff using: tpch10
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-14T16:16:46Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      370.78 / 372.71 ±1.53 / 375.02 ms │     367.32 / 369.51 ±1.62 / 371.85 ms │     no change │
│ QQuery 2  │     479.03 / 498.26 ±12.67 / 512.09 ms │     441.93 / 449.16 ±4.75 / 456.30 ms │ +1.11x faster │
│ QQuery 3  │     550.68 / 651.96 ±51.73 / 692.64 ms │     504.72 / 514.53 ±5.69 / 521.96 ms │ +1.27x faster │
│ QQuery 4  │     382.30 / 478.86 ±50.61 / 522.66 ms │     341.16 / 343.59 ±2.52 / 346.66 ms │ +1.39x faster │
│ QQuery 5  │  1094.76 / 1119.98 ±14.85 / 1136.27 ms │  989.92 / 1035.74 ±30.16 / 1083.34 ms │ +1.08x faster │
│ QQuery 6  │      134.61 / 137.54 ±3.24 / 143.63 ms │     132.58 / 135.76 ±4.78 / 145.26 ms │     no change │
│ QQuery 7  │   1529.12 / 1544.02 ±8.52 / 1554.66 ms │ 1352.09 / 1364.35 ±13.84 / 1390.36 ms │ +1.13x faster │
│ QQuery 8  │ 1495.43 / 1983.26 ±252.68 / 2161.10 ms │ 1178.60 / 1195.20 ±16.42 / 1219.19 ms │ +1.66x faster │
│ QQuery 9  │ 1985.32 / 2251.76 ±135.70 / 2348.90 ms │ 1769.17 / 1861.72 ±83.52 / 1962.04 ms │ +1.21x faster │
│ QQuery 10 │      530.10 / 533.74 ±4.88 / 543.33 ms │    496.72 / 511.79 ±15.67 / 531.85 ms │     no change │
│ QQuery 11 │      455.90 / 464.13 ±5.22 / 470.06 ms │     416.63 / 426.94 ±9.64 / 440.31 ms │ +1.09x faster │
│ QQuery 12 │      288.98 / 292.38 ±2.53 / 295.72 ms │     277.24 / 280.50 ±3.22 / 285.47 ms │     no change │
│ QQuery 13 │      366.95 / 373.42 ±4.66 / 379.47 ms │     346.27 / 354.40 ±4.90 / 358.95 ms │ +1.05x faster │
│ QQuery 14 │      195.18 / 198.85 ±2.46 / 202.91 ms │     192.87 / 197.00 ±2.91 / 200.35 ms │     no change │
│ QQuery 15 │      323.95 / 331.40 ±6.53 / 342.87 ms │     319.56 / 326.97 ±6.54 / 339.16 ms │     no change │
│ QQuery 16 │      121.75 / 123.85 ±2.25 / 127.96 ms │     114.45 / 116.88 ±2.91 / 122.43 ms │ +1.06x faster │
│ QQuery 17 │ 1574.15 / 1819.60 ±123.13 / 1892.65 ms │ 1372.85 / 1388.43 ±10.80 / 1402.63 ms │ +1.31x faster │
│ QQuery 18 │  1535.10 / 1560.14 ±19.95 / 1594.45 ms │ 1407.54 / 1451.07 ±36.80 / 1513.73 ms │ +1.08x faster │
│ QQuery 19 │     276.90 / 290.88 ±17.67 / 325.49 ms │    277.65 / 291.86 ±25.11 / 342.04 ms │     no change │
│ QQuery 20 │      451.87 / 457.31 ±4.64 / 464.09 ms │    417.57 / 429.48 ±12.70 / 453.17 ms │ +1.06x faster │
│ QQuery 21 │ 2981.79 / 3226.93 ±156.76 / 3396.23 ms │ 2602.88 / 2639.31 ±24.65 / 2668.51 ms │ +1.22x faster │
│ QQuery 22 │      190.75 / 194.35 ±5.46 / 205.18 ms │     153.72 / 160.32 ±4.83 / 168.41 ms │ +1.21x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18905.31ms │
│ Total Time (sort_redesign)   │ 15844.52ms │
│ Average Time (HEAD)          │   859.33ms │
│ Average Time (sort_redesign) │   720.21ms │
│ Queries Faster               │         15 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          7 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric	Value
Wall time	94.9s
Peak memory	12.8 GiB
Avg memory	8.6 GiB
CPU user	868.4s
CPU sys	74.1s
Peak spill	0 B

tpch10 — branch

Metric	Value
Wall time	79.5s
Peak memory	10.8 GiB
Avg memory	8.0 GiB
CPU user	782.1s
CPU sys	67.4s
Peak spill	0 B

File an issue against this benchmark runner

mbutrovich · 2026-04-14T16:37:42Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      370.78 / 372.71 ±1.53 / 375.02 ms │     367.32 / 369.51 ±1.62 / 371.85 ms │     no change │
│ QQuery 2  │     479.03 / 498.26 ±12.67 / 512.09 ms │     441.93 / 449.16 ±4.75 / 456.30 ms │ +1.11x faster │
│ QQuery 3  │     550.68 / 651.96 ±51.73 / 692.64 ms │     504.72 / 514.53 ±5.69 / 521.96 ms │ +1.27x faster │
│ QQuery 4  │     382.30 / 478.86 ±50.61 / 522.66 ms │     341.16 / 343.59 ±2.52 / 346.66 ms │ +1.39x faster │
│ QQuery 5  │  1094.76 / 1119.98 ±14.85 / 1136.27 ms │  989.92 / 1035.74 ±30.16 / 1083.34 ms │ +1.08x faster │
│ QQuery 6  │      134.61 / 137.54 ±3.24 / 143.63 ms │     132.58 / 135.76 ±4.78 / 145.26 ms │     no change │
│ QQuery 7  │   1529.12 / 1544.02 ±8.52 / 1554.66 ms │ 1352.09 / 1364.35 ±13.84 / 1390.36 ms │ +1.13x faster │
│ QQuery 8  │ 1495.43 / 1983.26 ±252.68 / 2161.10 ms │ 1178.60 / 1195.20 ±16.42 / 1219.19 ms │ +1.66x faster │
│ QQuery 9  │ 1985.32 / 2251.76 ±135.70 / 2348.90 ms │ 1769.17 / 1861.72 ±83.52 / 1962.04 ms │ +1.21x faster │
│ QQuery 10 │      530.10 / 533.74 ±4.88 / 543.33 ms │    496.72 / 511.79 ±15.67 / 531.85 ms │     no change │
│ QQuery 11 │      455.90 / 464.13 ±5.22 / 470.06 ms │     416.63 / 426.94 ±9.64 / 440.31 ms │ +1.09x faster │
│ QQuery 12 │      288.98 / 292.38 ±2.53 / 295.72 ms │     277.24 / 280.50 ±3.22 / 285.47 ms │     no change │
│ QQuery 13 │      366.95 / 373.42 ±4.66 / 379.47 ms │     346.27 / 354.40 ±4.90 / 358.95 ms │ +1.05x faster │
│ QQuery 14 │      195.18 / 198.85 ±2.46 / 202.91 ms │     192.87 / 197.00 ±2.91 / 200.35 ms │     no change │
│ QQuery 15 │      323.95 / 331.40 ±6.53 / 342.87 ms │     319.56 / 326.97 ±6.54 / 339.16 ms │     no change │
│ QQuery 16 │      121.75 / 123.85 ±2.25 / 127.96 ms │     114.45 / 116.88 ±2.91 / 122.43 ms │ +1.06x faster │
│ QQuery 17 │ 1574.15 / 1819.60 ±123.13 / 1892.65 ms │ 1372.85 / 1388.43 ±10.80 / 1402.63 ms │ +1.31x faster │
│ QQuery 18 │  1535.10 / 1560.14 ±19.95 / 1594.45 ms │ 1407.54 / 1451.07 ±36.80 / 1513.73 ms │ +1.08x faster │
│ QQuery 19 │     276.90 / 290.88 ±17.67 / 325.49 ms │    277.65 / 291.86 ±25.11 / 342.04 ms │     no change │
│ QQuery 20 │      451.87 / 457.31 ±4.64 / 464.09 ms │    417.57 / 429.48 ±12.70 / 453.17 ms │ +1.06x faster │
│ QQuery 21 │ 2981.79 / 3226.93 ±156.76 / 3396.23 ms │ 2602.88 / 2639.31 ±24.65 / 2668.51 ms │ +1.22x faster │
│ QQuery 22 │      190.75 / 194.35 ±5.46 / 205.18 ms │     153.72 / 160.32 ±4.83 / 168.41 ms │ +1.21x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18905.31ms │
│ Total Time (sort_redesign)   │ 15844.52ms │
│ Average Time (HEAD)          │   859.33ms │
│ Average Time (sort_redesign) │   720.21ms │
│ Queries Faster               │         15 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          7 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage
tpch10 — base (merge-base)

Metric Value
Wall time 94.9s
Peak memory 12.8 GiB
Avg memory 8.6 GiB
CPU user 868.4s
CPU sys 74.1s
Peak spill 0 B
tpch10 — branch

Metric Value
Wall time 79.5s
Peak memory 10.8 GiB
Avg memory 8.0 GiB
CPU user 782.1s
CPU sys 67.4s
Peak spill 0 B
File an issue against this benchmark runner

So this is showing the improvement afforded by both the ExternalSorter rewrite (which helps lexsort by reducing fan-in) and radix sorting. I will push a commit that defaults radix sort off, run the benchmarks again to get a baseline understanding of the ExternalSorter changes.

This reverts commit 482e72c.

adriangbot · 2026-04-14T16:45:59Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4245617152-1237-trv8q 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sort_redesign (482e72c) to 0143dfe (merge-base) diff using: tpch10
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-14T17:02:07Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      369.16 / 371.50 ±1.50 / 373.82 ms │     368.07 / 370.91 ±1.62 / 372.36 ms │     no change │
│ QQuery 2  │      482.08 / 492.17 ±8.28 / 506.22 ms │     433.33 / 444.34 ±5.98 / 450.13 ms │ +1.11x faster │
│ QQuery 3  │     615.12 / 652.70 ±25.94 / 683.44 ms │     505.60 / 511.45 ±3.64 / 516.46 ms │ +1.28x faster │
│ QQuery 4  │     465.10 / 493.57 ±17.87 / 509.76 ms │     338.03 / 340.79 ±2.28 / 343.59 ms │ +1.45x faster │
│ QQuery 5  │  1064.46 / 1096.65 ±28.41 / 1137.47 ms │ 1004.03 / 1053.30 ±30.21 / 1086.10 ms │     no change │
│ QQuery 6  │      132.93 / 137.08 ±6.96 / 150.97 ms │     133.75 / 136.14 ±3.28 / 142.56 ms │     no change │
│ QQuery 7  │  1518.69 / 1545.85 ±32.91 / 1607.84 ms │ 1344.80 / 1370.15 ±25.64 / 1415.11 ms │ +1.13x faster │
│ QQuery 8  │ 1471.85 / 2012.19 ±270.24 / 2155.94 ms │ 1173.34 / 1225.25 ±60.85 / 1308.03 ms │ +1.64x faster │
│ QQuery 9  │ 2026.03 / 2167.95 ±127.83 / 2335.75 ms │ 1738.14 / 1801.47 ±57.88 / 1875.72 ms │ +1.20x faster │
│ QQuery 10 │      519.68 / 533.14 ±8.21 / 545.06 ms │     506.37 / 512.62 ±7.24 / 526.36 ms │     no change │
│ QQuery 11 │      447.97 / 455.81 ±7.10 / 467.81 ms │     415.12 / 426.13 ±6.22 / 433.26 ms │ +1.07x faster │
│ QQuery 12 │      284.55 / 288.58 ±3.24 / 294.03 ms │     274.70 / 281.79 ±4.01 / 286.60 ms │     no change │
│ QQuery 13 │      362.47 / 371.52 ±7.65 / 384.95 ms │     337.67 / 341.43 ±3.52 / 347.69 ms │ +1.09x faster │
│ QQuery 14 │      194.71 / 197.06 ±1.55 / 198.77 ms │     191.35 / 195.28 ±3.56 / 201.28 ms │     no change │
│ QQuery 15 │      324.22 / 331.51 ±5.40 / 341.04 ms │     320.89 / 323.67 ±3.29 / 329.87 ms │     no change │
│ QQuery 16 │      119.57 / 123.10 ±3.64 / 129.71 ms │     114.08 / 114.91 ±0.74 / 116.29 ms │ +1.07x faster │
│ QQuery 17 │ 1562.46 / 1637.28 ±123.05 / 1882.73 ms │  1362.70 / 1369.87 ±4.39 / 1375.48 ms │ +1.20x faster │
│ QQuery 18 │  1520.50 / 1555.74 ±33.25 / 1617.16 ms │ 1375.78 / 1415.36 ±26.77 / 1450.02 ms │ +1.10x faster │
│ QQuery 19 │     278.50 / 290.44 ±14.92 / 318.03 ms │    272.95 / 288.41 ±24.93 / 338.10 ms │     no change │
│ QQuery 20 │      444.79 / 451.07 ±3.45 / 453.90 ms │     436.88 / 443.21 ±5.19 / 448.41 ms │     no change │
│ QQuery 21 │ 2975.83 / 3236.75 ±151.25 / 3419.22 ms │ 2612.11 / 2652.91 ±34.67 / 2702.19 ms │ +1.22x faster │
│ QQuery 22 │      192.70 / 199.78 ±7.65 / 213.77 ms │     151.75 / 161.45 ±8.55 / 176.35 ms │ +1.24x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18641.44ms │
│ Total Time (sort_redesign)   │ 15780.86ms │
│ Average Time (HEAD)          │   847.34ms │
│ Average Time (sort_redesign) │   717.31ms │
│ Queries Faster               │         13 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          9 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage

tpch10 — base (merge-base)

Metric	Value
Wall time	93.6s
Peak memory	11.1 GiB
Avg memory	8.6 GiB
CPU user	865.9s
CPU sys	71.6s
Peak spill	0 B

tpch10 — branch

Metric	Value
Wall time	79.2s
Peak memory	10.9 GiB
Avg memory	8.0 GiB
CPU user	782.3s
CPU sys	66.2s
Peak spill	0 B

File an issue against this benchmark runner

mbutrovich · 2026-04-14T17:07:53Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Details

Comparing HEAD and sort_redesign
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                   HEAD ┃                         sort_redesign ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │      369.16 / 371.50 ±1.50 / 373.82 ms │     368.07 / 370.91 ±1.62 / 372.36 ms │     no change │
│ QQuery 2  │      482.08 / 492.17 ±8.28 / 506.22 ms │     433.33 / 444.34 ±5.98 / 450.13 ms │ +1.11x faster │
│ QQuery 3  │     615.12 / 652.70 ±25.94 / 683.44 ms │     505.60 / 511.45 ±3.64 / 516.46 ms │ +1.28x faster │
│ QQuery 4  │     465.10 / 493.57 ±17.87 / 509.76 ms │     338.03 / 340.79 ±2.28 / 343.59 ms │ +1.45x faster │
│ QQuery 5  │  1064.46 / 1096.65 ±28.41 / 1137.47 ms │ 1004.03 / 1053.30 ±30.21 / 1086.10 ms │     no change │
│ QQuery 6  │      132.93 / 137.08 ±6.96 / 150.97 ms │     133.75 / 136.14 ±3.28 / 142.56 ms │     no change │
│ QQuery 7  │  1518.69 / 1545.85 ±32.91 / 1607.84 ms │ 1344.80 / 1370.15 ±25.64 / 1415.11 ms │ +1.13x faster │
│ QQuery 8  │ 1471.85 / 2012.19 ±270.24 / 2155.94 ms │ 1173.34 / 1225.25 ±60.85 / 1308.03 ms │ +1.64x faster │
│ QQuery 9  │ 2026.03 / 2167.95 ±127.83 / 2335.75 ms │ 1738.14 / 1801.47 ±57.88 / 1875.72 ms │ +1.20x faster │
│ QQuery 10 │      519.68 / 533.14 ±8.21 / 545.06 ms │     506.37 / 512.62 ±7.24 / 526.36 ms │     no change │
│ QQuery 11 │      447.97 / 455.81 ±7.10 / 467.81 ms │     415.12 / 426.13 ±6.22 / 433.26 ms │ +1.07x faster │
│ QQuery 12 │      284.55 / 288.58 ±3.24 / 294.03 ms │     274.70 / 281.79 ±4.01 / 286.60 ms │     no change │
│ QQuery 13 │      362.47 / 371.52 ±7.65 / 384.95 ms │     337.67 / 341.43 ±3.52 / 347.69 ms │ +1.09x faster │
│ QQuery 14 │      194.71 / 197.06 ±1.55 / 198.77 ms │     191.35 / 195.28 ±3.56 / 201.28 ms │     no change │
│ QQuery 15 │      324.22 / 331.51 ±5.40 / 341.04 ms │     320.89 / 323.67 ±3.29 / 329.87 ms │     no change │
│ QQuery 16 │      119.57 / 123.10 ±3.64 / 129.71 ms │     114.08 / 114.91 ±0.74 / 116.29 ms │ +1.07x faster │
│ QQuery 17 │ 1562.46 / 1637.28 ±123.05 / 1882.73 ms │  1362.70 / 1369.87 ±4.39 / 1375.48 ms │ +1.20x faster │
│ QQuery 18 │  1520.50 / 1555.74 ±33.25 / 1617.16 ms │ 1375.78 / 1415.36 ±26.77 / 1450.02 ms │ +1.10x faster │
│ QQuery 19 │     278.50 / 290.44 ±14.92 / 318.03 ms │    272.95 / 288.41 ±24.93 / 338.10 ms │     no change │
│ QQuery 20 │      444.79 / 451.07 ±3.45 / 453.90 ms │     436.88 / 443.21 ±5.19 / 448.41 ms │     no change │
│ QQuery 21 │ 2975.83 / 3236.75 ±151.25 / 3419.22 ms │ 2612.11 / 2652.91 ±34.67 / 2702.19 ms │ +1.22x faster │
│ QQuery 22 │      192.70 / 199.78 ±7.65 / 213.77 ms │     151.75 / 161.45 ±8.55 / 176.35 ms │ +1.24x faster │
└───────────┴────────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)            │ 18641.44ms │
│ Total Time (sort_redesign)   │ 15780.86ms │
│ Average Time (HEAD)          │   847.34ms │
│ Average Time (sort_redesign) │   717.31ms │
│ Queries Faster               │         13 │
│ Queries Slower               │          0 │
│ Queries with No Change       │          9 │
│ Queries with Failure         │          0 │
└──────────────────────────────┴────────────┘

Resource Usage
File an issue against this benchmark runner

So it seems like the big win here is the ExternalSorter refactor reducing merge fan-in, considering this run has radix sort off and the speedup is still pretty strong.

This PR is mostly for experimenting anyway, but maybe this is motivation to structure the future work as:

ExternalSorter refactor to use BatchCoalescer to reduce merge fan-in.
After radix sort kernel lands in Arrow-rs and DF updates to that version of Arrow-rs, add radix sort support.

mbutrovich added 10 commits April 9, 2026 16:47

Bring over apache/arrow-rs/9683, integrate into sorts, add heuristic …

e1e1d09

…to choose between sort implementations.

Merge branch 'main' into arrow_rs_9683

9371dd1

Stash with implementation, need to fix accounting for one test.

fc35b08

Fix more tests.

d822a50

Tests pass.

54be475

Cleanup.

a42ba49

More cleanup.

24dbdb7

More tests.

e83ffc6

More tests.

60fbdfb

Cleanup before pushing.

8e8f774

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Apr 13, 2026

mbutrovich added 2 commits April 13, 2026 18:57

Fix CI failures.

aa93a6e

Fix configs.md.

af514fe

github-actions bot added the documentation Improvements or additions to documentation label Apr 13, 2026

mbutrovich added 3 commits April 13, 2026 19:52

Fix CI failures.

c527553

Avoid radix sort for decimal types for now.

83860c0

Fix information_schema.slt test.

a88eacf

mbutrovich mentioned this pull request Apr 14, 2026

feat(arrow-row): add MSD radix sort kernel for row-encoded keys apache/arrow-rs#9683

Open

mbutrovich added the performance Make DataFusion faster label Apr 14, 2026

Update to latest radix kernel from arrow-rs PR.

175dcf6

gratus00 reviewed Apr 14, 2026

View reviewed changes

mbutrovich added 5 commits April 14, 2026 09:41

Use lexsort for single columns, radix otherwise. Should help with Q11…

4bdf871

… regression.

Address some PR feedback.

a1f0193

Address some PR feedback.

bbda50f

Update failing test.

2698b85

Update failing test for realsies.

d5d3ef7

mbutrovich added 2 commits April 14, 2026 11:44

Cleanup.

73bb06b

Fix copy.slt test to show new metrics.

7a75c38

mbutrovich added 2 commits April 14, 2026 12:42

Temporarily default radix to false to get benchmarks.

482e72c

Revert "Temporarily default radix to false to get benchmarks."

96dd0df

This reverts commit 482e72c.

apache deleted a comment from adriangbot Apr 14, 2026

mbutrovich changed the title ~~perf: Optimize ExternalSorter with chunked sort pipeline~~ perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel#21600

perf: Optimize ExternalSorter with chunked sort pipeline and radix sort kernel#21600
mbutrovich wants to merge 26 commits intoapache:mainfrom
mbutrovich:sort_redesign

mbutrovich commented Apr 13, 2026 •

edited

Loading

Uh oh!

gratus00 Apr 14, 2026

Uh oh!

mbutrovich Apr 14, 2026 •

edited

Loading

Uh oh!

gratus00 Apr 14, 2026

Uh oh!

gratus00 Apr 14, 2026

Uh oh!

mbutrovich Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		let use_radix_for_this_batch =
		self.use_radix && batch.num_rows() > self.batch_size;

Conversation

mbutrovich commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Chunked sort pipeline

Uniform coalescing, per-batch algorithm selection

Metrics

Dead code removal

Config changes

Are these changes tested?

Are there any user-facing changes?

Uh oh!

gratus00 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gratus00 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gratus00 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

adriangbot commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Apr 13, 2026 •

edited

Loading

mbutrovich Apr 14, 2026 •

edited

Loading