Optimize ByteViewGroupValueBuilder vectorized_append#21567
Optimize ByteViewGroupValueBuilder vectorized_append#21567Dandandan wants to merge 8 commits intoapache:mainfrom
Conversation
… views Instead of calling make_view to reconstruct views from scratch, reuse the input array's views directly. For inline values (len <= 12) the view is copied as-is. For non-inline values, the data is copied and the view is updated with the new buffer_index/offset using ByteView's builder API. Also switches the Nulls::None and Nulls::Some branches from for loops to extend, and removes the now-unused do_append_val_inner and ensure_in_progress_big_enough methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
98ca861 to
8b65fee
Compare
|
run benchmark aggregate_vectorized |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8b65fee) to 16e578d (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8b65fee) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8b65fee) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8b65fee) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
…ppend Instead of copying non-inline values one by one into in_progress, pre-calculate total non-inline bytes, allocate a single buffer, and copy all values into it. For all-inline inputs, skip buffer allocation entirely and just copy views. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (ac29b32) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (ac29b32) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (ac29b32) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (ac29b32) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Instead of allocating a single buffer for all non-inline bytes (which could exceed max_block_size), use the existing in_progress/completed flow that rotates buffers at max_block_size boundaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8867a6e) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8867a6e) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (8867a6e) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
Instead of copying non-inline data, reuse input array's data buffers directly by adding them to completed and remapping buffer_index in views. This eliminates all memcpy for non-inline values. Add gc_buffers() that compacts all referenced data into fresh in_progress/completed buffers when reused_buffer_bytes exceeds 2 * max_block_size, preventing unbounded memory growth from holding entire input buffers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (faeabd2) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (faeabd2) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (faeabd2) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Instead of always compacting all buffers, GC now checks each buffer individually: if >=90% of the buffer is referenced, keep it as-is (good locality already); otherwise copy only referenced data into a new compacted buffer. Also changes block size strategy: - Start at 2MB (BYTE_VIEW_INITIAL_BLOCK_SIZE) - Double on each GC (up to 16MB BYTE_VIEW_MAX_BLOCK_SIZE) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
A count of reused buffers (>10) is a better trigger than total bytes, since each reused buffer is a separate allocation held alive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (9fc0180) to 16e578d (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (9fc0180) to 16e578d (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (9fc0180) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
Phase 1: extend views quickly by temporarily referencing the input array's data buffers (no per-row memcpy in the hot loop). Phase 2: immediately compact newly-added non-inline views by copying their data from the input buffers into our own in_progress/completed. This gives perfect data locality with no deferred GC needed. Removes the gc_buffers() method, reused_buffer_count tracking, and the doubling block size strategy since they are no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In compact_new_views, count referenced bytes per reused buffer. If >=90% of a buffer is referenced, keep it as-is in completed (just remap view indices). Only compact buffers with low utilization. This avoids unnecessary copies when most of an input buffer's data is actually used, which is the common case in hash grouping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark clickbench_partitioned clickbench_extended |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (c6044ad) to 16e578d (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byteview-vectorized-append (c6044ad) to 16e578d (merge-base) diff using: clickbench_extended File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_extended — base (merge-base)
clickbench_extended — branch
File an issue against this benchmark runner |
Which issue does this PR close?
N/A - Performance optimization
Rationale for this change
The
vectorized_append_innermethod inByteViewGroupValueBuilderwas usingmake_viewto reconstruct views from scratch and copying non-inline data byte-by-byte in a per-row loop. This is unnecessary since the input array already has correctly formed views with length and prefix.What changes are included in this PR?
extendover views, temporarily referencing input array buffers (no per-row memcpy in the hot loop)make_view; useByteViewbuilder API for buffer_index/offset updatescopy_from_viewhelper for single-rowappend_val_innerdo_append_val_inner,ensure_in_progress_big_enough, andmake_viewimportBenchmark results
Are these changes tested?
Existing tests cover the functionality. Verified with
cargo checkandcargo clippy.Are there any user-facing changes?
No - internal performance optimization only.
🤖 Generated with Claude Code