perf: optimize count distinct fast path and sliding count distinct#21782
perf: optimize count distinct fast path and sliding count distinct#21782lyne7-sc wants to merge 10 commits intoapache:mainfrom
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf-count-distinct-clean (d71d55b) to 4bff17e (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf-count-distinct-clean (d71d55b) to 4bff17e (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf-count-distinct-clean (d71d55b) to 4bff17e (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
synced with |
EeshanBembi
left a comment
There was a problem hiding this comment.
Thanks for the optimization — the null-free fast path for primitive COUNT(DISTINCT) is a nice win and the benchmark numbers look solid. One correctness concern in the sliding accumulator's retract_batch before merging, plus a couple of smaller suggestions.
| let arr = as_primitive_array::<T>(&values[0])?; | ||
| if arr.null_count() == 0 { | ||
| for value in arr.values().iter() { | ||
| if let Some(count) = self.counts.get_mut(value) { |
There was a problem hiding this comment.
The retract path silently ignores values not found in self.counts:
if let Some(count) = self.counts.get_mut(value) {
*count -= 1;
if *count == 0 { self.counts.remove(value); }
}
// values absent from the map: silently skippedIn the current window-frame infrastructure the caller guarantees symmetry (every retract matches a prior update), so this never fires in practice. But a silent no-op means bugs in the caller become invisible and corrupt the distinct count without any diagnostic. Please add at least a debug_assert!:
let count = self.counts.get_mut(value);
debug_assert!(count.is_some(), "retract_batch called for a value not in the accumulator");
if let Some(count) = count {
*count -= 1;
if *count == 0 { self.counts.remove(value); }
}Zero cost in release builds, catches invariant violations immediately in debug/test.
EeshanBembi
left a comment
There was a problem hiding this comment.
Two smaller follow-ups (non-blocking):
| let arr = as_primitive_array::<T>(&values[0])?; | ||
| arr.iter().for_each(|value| { | ||
| if let Some(value) = value { | ||
| if arr.null_count() > 0 { |
There was a problem hiding this comment.
Nit: a short comment here clarifies why the per-batch null_count check is correct even when earlier batches had nulls:
// Fast path: no nulls in this batch, skip per-element validity checks.
// Earlier batches with nulls are already handled; the HashSet union is correct
// regardless of insertion order.
if arr.null_count() == 0 { ... }
EeshanBembi
left a comment
There was a problem hiding this comment.
Benchmark coverage suggestion (non-blocking):
| let arr = as_primitive_array::<T>(&values[0])?; | ||
| arr.iter().for_each(|value| { | ||
| if let Some(value) = value { | ||
| if arr.null_count() > 0 { |
There was a problem hiding this comment.
The benchmarks only generate Some(...) values. Please consider adding a variant with ~10% nulls to confirm the slow path (the null_count > 0 branch) doesn't regress — that's a common production workload.
There was a problem hiding this comment.
Added null benchmarks for the i64 path (10% and 30% nulls) for both regular and sliding count(distinct):
group main_i64_nulls pr_i64_nulls
----- -------------- ------------
count_distinct i64 80% distinct 1.90 74.7±2.52µs ? ?/sec 1.00 39.2±2.05µs ? ?/sec
count_distinct i64 80% distinct 10% nulls 1.00 71.4±2.08µs ? ?/sec 1.03 73.5±2.64µs ? ?/sec
count_distinct i64 80% distinct 30% nulls 1.03 62.0±8.24µs ? ?/sec 1.00 60.3±2.06µs ? ?/sec
count_distinct_sliding i64 mid window_1024 1.42 1167.2±29.39µs ? ?/sec 1.00 824.5±27.08µs ? ?/sec
count_distinct_sliding i64 mid window_256 1.47 1238.5±96.19µs ? ?/sec 1.00 841.9±16.41µs ? ?/sec
count_distinct_sliding i64 mid_10pct_nulls window_1024 1.30 1503.1±60.08µs ? ?/sec 1.00 1155.0±56.14µs ? ?/sec
count_distinct_sliding i64 mid_10pct_nulls window_256 1.27 1539.5±34.95µs ? ?/sec 1.00 1213.3±96.65µs ? ?/sec
count_distinct_sliding i64 mid_30pct_nulls window_1024 1.27 1442.2±34.19µs ? ?/sec 1.00 1139.5±91.77µs ? ?/sec
count_distinct_sliding i64 mid_30pct_nulls window_256 1.21 1469.4±31.00µs ? ?/sec 1.00 1212.5±220.52µs ? ?/sec
no null-path regression observed compared to main.
|
@EeshanBembi Thanks for the suggestions, that makes sense — I’ve updated it accordingly. |
Which issue does this PR close?
Rationale for this change
This PR improves two hot paths for count distinct.
For the primitive non-sliding path, it adds a fast path to avoid per-row null handling when the input batch contains no nulls.
For the sliding path, it adds a primitive accumulator and keeps other types on the generic path for now.
What changes are included in this PR?
This pr adds a null-free fast path for primitive
COUNT(DISTINCT)and a primitive sliding distinct accumulator for integer types.Benchmarks
Are these changes tested?
yes
Are there any user-facing changes?
no