Rich t kid/dictionary encoding hash optmize#21589
Rich t kid/dictionary encoding hash optmize#21589Rich-T-kid wants to merge 7 commits intoapache:mainfrom
Conversation
|
Currenly the CI is breaking because my current emit implementations returns the values dirrectly instead of returning a dictionary encoded column. & im not handling nulls. |
8ea71fd to
cc18a8f
Compare
|
run benchmarks |
|
Hi @Rich-T-kid, thanks for the request (#21589 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/Dictionary-encoding-Hash-optmize (aa69892) to 37cd3de (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/Dictionary-encoding-Hash-optmize (aa69892) to 37cd3de (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/Dictionary-encoding-Hash-optmize (aa69892) to 37cd3de (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Wrote some criterion benchmarks and they reflect what the query benchmarks show. I think storing Scalar values directly in the hash map is causing performance issues. Need to profile to see heap allocations. |
|
ScalarValue::try_from_array (38%) — allocating a new ScalarValue heap object for every single row lots of heap allocations are happening on the hot path (_xzm_free) - a solution to this may be to pre-allocate space for the unique values array and hashmap. |
Optimization 1 — Hash cachingInstead of hashing each row's value individually, pre-compute hashes for all d distinct values in the values array once per batch and cache them in a Vec indexed by dictionary key. For a batch of n rows with d distinct values this reduces hash computations from O(n) to O(d). Optimization 2 — Eliminate ScalarValueReplace HashMap<ScalarValue, usize> with a structure that stores raw hashes and raw string slices pointing directly into the Arrow buffer. This eliminates per-row heap allocation (ScalarValue::try_from_array) and deallocation (drop_in_place) which the profiler shows accounts for ~60% of intern time combined. Optimization 3 — Pre-allocate using occupancyUse dict_array.occupancy().count_set_bits() to determine the number of truly distinct non-null values in the batch upfront and pre-allocate internal storage accordingly. This avoids incremental Vec growth during intern. |

Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?