Reduce reallocation costs in high-cardinality GROUP BY#21385
Reduce reallocation costs in high-cardinality GROUP BY#21385Dandandan wants to merge 1 commit intoapache:mainfrom
Conversation
- Combine AvgGroupsAccumulator's separate `counts` and `sums` Vecs into a single `Vec<AvgState>`, halving reallocations and improving cache locality for high-cardinality GROUP BY queries - Add `reserve(rows.len())` in PrimitiveGroupValueBuilder::vectorized_append to avoid repeated small reallocations during push loops - Add `reserve(rows.len())` in ByteViewGroupValueBuilder::vectorized_append_inner for the same reason Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing reduce-reallocation-costs (45c2e44) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing reduce-reallocation-costs (45c2e44) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing reduce-reallocation-costs (45c2e44) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing reduce-reallocation-costs (45c2e44) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Performance optimization - no specific issue.
Rationale for this change
Profiling high-cardinality GROUP BY queries (e.g.
GROUP BY WatchID, ClientIPon ClickBench) shows ~40% of CPU time inmemmove/reallocfrom vector growing in accumulators and group value builders. This PR reduces that overhead.What changes are included in this PR?
AvgGroupsAccumulator: combine
countsandsumsinto singleVec<AvgState>— The AVG accumulator previously maintained two separateVecs (counts: Vec<u64>andsums: Vec<T::Native>) that both resized tototal_num_groupson every batch. By combining them into a singleVec<AvgState>, we halve the number of reallocations and improve cache locality (count and sum for the same group are now adjacent in memory).PrimitiveGroupValueBuilder: add
reserve(rows.len())before push loops —vectorized_appendwas pushing elements one by one without pre-reserving capacity, causing repeated small reallocations when capacity was exceeded mid-loop.ByteViewGroupValueBuilder: add
reserve(rows.len())before push loops — Same issue as above for the byte view variant.Are these changes tested?
Covered by existing aggregate sqllogic tests (all pass).
Are there any user-facing changes?
No. Internal optimization only.