Deduplicate non-inline StringView values in GroupValuesColumn#21332
Deduplicate non-inline StringView values in GroupValuesColumn#21332Dandandan wants to merge 1 commit intoapache:mainfrom
Conversation
For values > 12 bytes in the vectorized append path, use a hash-based dedup map to avoid storing duplicate string bytes in the buffer. This reduces memory usage when the same string values appear across multiple batches in GROUP BY columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing dedup-stringview-group-values (4413ff4) to 1e93a67 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing dedup-stringview-group-values (4413ff4) to 1e93a67 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing dedup-stringview-group-values (4413ff4) to 1e93a67 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |

Which issue does this PR close?
Closes #.
Rationale for this change
In multi-column GROUP BY,
ByteViewGroupValueBuilderstores every non-inline string value (>12 bytes) by copying the full bytes into its buffer, even when the same value has already been stored. For columns with low cardinality (e.g.,state,country), this leads to significant memory waste.What changes are included in this PR?
Adds a per-builder hash-based dedup map (
HashMap<u64, u128>) that maps value hashes to their views. In the vectorized append path (the hot path), before copying bytes for a non-inline value, we check if a value with the same hash was already appended. If so, we reuse the existing view — skipping the byte copy entirely.Key details:
append_valis unchanged)take_nsince buffer indices shiftappend_value_to_buffer()helper to avoid code duplicationAre these changes tested?
Yes — added
test_byte_view_vectorized_append_dedupwhich verifies:Are there any user-facing changes?
No — this is an internal optimization. Query results are unchanged.
🤖 Generated with Claude Code