Commit cad3865
authored
fix: correct weight handling in approx_percentile_cont_with_weight (#19941)
The approx_percentile_cont_with_weight function was producing incorrect
results due to wrong weight handling in the TDigest implementation.
Root cause: In TDigest::new_with_centroid(), the count field was
hardcoded to 1 regardless of the actual centroid weight, while the
weight was correctly used in the sum calculation. This mismatch caused
incorrect percentile calculations since estimate_quantile() uses count
to compute the rank.
Changes:
- Changed TDigest::count from u64 to f64 to properly support fractional
weights (consistent with ClickHouse's TDigest implementation)
- Fixed new_with_centroid() to use centroid.weight for count
- Updated state_fields() in approx_percentile_cont and approx_median to
use Float64 for the count field
- Added early return in merge_digests() when all centroids have zero
weight to prevent panic
- Updated test expectations to reflect correct weighted percentile
behavior
## Which issue does this PR close?
- Closes #19940
## Rationale for this change
The `approx_percentile_cont_with_weight` function produces incorrect
weighted percentile results. The bug is in the TDigest implementation
where `new_with_centroid()` sets `count: 1` regardless of the actual
centroid weight, while the weight is used elsewhere in centroid merging.
This mismatch corrupts the percentile calculation.
## What changes are included in this PR?
- Changed `TDigest::count` from `u64` to `f64` to properly support
fractional weights (consistent with [ClickHouse's TDigest
implementation](https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87))
- Fixed `new_with_centroid()` to use `centroid.weight` for count
- Updated `state_fields()` in `approx_percentile_cont` and
`approx_median` to use `Float64` for the count field
- Added early return in `merge_digests()` when all centroids have zero
weight to prevent panic
- Updated test expectations to reflect correct weighted percentile
behavior
## Are these changes tested?
Yes.
- All existing unit tests in tdigest.rs pass (7 tests)
- All SQL logic tests for aggregate functions pass
- Manual testing confirms correct behavior with various weight
distributions (equal weights, heavy low/high values, linear weights,
fractional weights)
## Are there any user-facing changes?
Yes, this is a breaking change:
1. Result changes: approx_percentile_cont_with_weight now returns
correct weighted percentiles. Queries relying on the previous
(incorrect) behavior will see different results.
2. Serialized state format change: The TDigest state field count changes
from UInt64 to Float64. Any existing serialized/checkpointed TDigest
state will be incompatible and cannot be restored.
3. Edge case behavior change: When all weights are zero, the function
now returns NULL instead of the previous undefined behavior.1 parent f0de02f commit cad3865
4 files changed
Lines changed: 40 additions & 45 deletions
File tree
- datafusion
- functions-aggregate-common/src
- functions-aggregate/src
- sqllogictest/test_files
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | 52 | | |
64 | 53 | | |
65 | 54 | | |
| |||
110 | 99 | | |
111 | 100 | | |
112 | 101 | | |
113 | | - | |
| 102 | + | |
114 | 103 | | |
115 | 104 | | |
116 | 105 | | |
| |||
120 | 109 | | |
121 | 110 | | |
122 | 111 | | |
123 | | - | |
124 | | - | |
| 112 | + | |
| 113 | + | |
125 | 114 | | |
126 | 115 | | |
127 | 116 | | |
| |||
133 | 122 | | |
134 | 123 | | |
135 | 124 | | |
136 | | - | |
| 125 | + | |
137 | 126 | | |
138 | 127 | | |
139 | 128 | | |
140 | 129 | | |
141 | 130 | | |
142 | 131 | | |
143 | | - | |
| 132 | + | |
144 | 133 | | |
145 | 134 | | |
146 | 135 | | |
| |||
170 | 159 | | |
171 | 160 | | |
172 | 161 | | |
173 | | - | |
174 | | - | |
| 162 | + | |
| 163 | + | |
175 | 164 | | |
176 | 165 | | |
177 | 166 | | |
| |||
216 | 205 | | |
217 | 206 | | |
218 | 207 | | |
219 | | - | |
| 208 | + | |
220 | 209 | | |
221 | 210 | | |
222 | 211 | | |
223 | 212 | | |
224 | | - | |
| 213 | + | |
225 | 214 | | |
226 | 215 | | |
227 | 216 | | |
| |||
233 | 222 | | |
234 | 223 | | |
235 | 224 | | |
236 | | - | |
| 225 | + | |
237 | 226 | | |
238 | 227 | | |
239 | 228 | | |
| |||
281 | 270 | | |
282 | 271 | | |
283 | 272 | | |
284 | | - | |
| 273 | + | |
285 | 274 | | |
286 | 275 | | |
287 | 276 | | |
| |||
353 | 342 | | |
354 | 343 | | |
355 | 344 | | |
356 | | - | |
| 345 | + | |
357 | 346 | | |
358 | 347 | | |
359 | 348 | | |
| |||
362 | 351 | | |
363 | 352 | | |
364 | 353 | | |
365 | | - | |
| 354 | + | |
366 | 355 | | |
367 | 356 | | |
368 | 357 | | |
| |||
373 | 362 | | |
374 | 363 | | |
375 | 364 | | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
376 | 370 | | |
377 | 371 | | |
378 | 372 | | |
| |||
397 | 391 | | |
398 | 392 | | |
399 | 393 | | |
400 | | - | |
| 394 | + | |
401 | 395 | | |
402 | 396 | | |
403 | 397 | | |
| |||
416 | 410 | | |
417 | 411 | | |
418 | 412 | | |
419 | | - | |
| 413 | + | |
420 | 414 | | |
421 | 415 | | |
422 | 416 | | |
| |||
440 | 434 | | |
441 | 435 | | |
442 | 436 | | |
443 | | - | |
| 437 | + | |
444 | 438 | | |
445 | 439 | | |
446 | 440 | | |
| |||
450 | 444 | | |
451 | 445 | | |
452 | 446 | | |
453 | | - | |
| 447 | + | |
454 | 448 | | |
455 | 449 | | |
456 | 450 | | |
| |||
563 | 557 | | |
564 | 558 | | |
565 | 559 | | |
566 | | - | |
| 560 | + | |
567 | 561 | | |
568 | 562 | | |
569 | 563 | | |
| |||
611 | 605 | | |
612 | 606 | | |
613 | 607 | | |
614 | | - | |
| 608 | + | |
615 | 609 | | |
616 | 610 | | |
617 | 611 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
110 | 110 | | |
111 | 111 | | |
112 | 112 | | |
113 | | - | |
| 113 | + | |
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| |||
Lines changed: 4 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
259 | 259 | | |
260 | 260 | | |
261 | 261 | | |
262 | | - | |
| 262 | + | |
263 | 263 | | |
264 | 264 | | |
265 | 265 | | |
| |||
436 | 436 | | |
437 | 437 | | |
438 | 438 | | |
439 | | - | |
| 439 | + | |
440 | 440 | | |
441 | 441 | | |
442 | 442 | | |
| |||
513 | 513 | | |
514 | 514 | | |
515 | 515 | | |
516 | | - | |
| 516 | + | |
517 | 517 | | |
518 | | - | |
| 518 | + | |
519 | 519 | | |
520 | 520 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2029 | 2029 | | |
2030 | 2030 | | |
2031 | 2031 | | |
2032 | | - | |
| 2032 | + | |
| 2033 | + | |
2033 | 2034 | | |
2034 | 2035 | | |
2035 | 2036 | | |
2036 | | - | |
| 2037 | + | |
2037 | 2038 | | |
2038 | 2039 | | |
2039 | 2040 | | |
| |||
2352 | 2353 | | |
2353 | 2354 | | |
2354 | 2355 | | |
2355 | | - | |
| 2356 | + | |
2356 | 2357 | | |
2357 | | - | |
2358 | | - | |
2359 | | - | |
| 2358 | + | |
| 2359 | + | |
| 2360 | + | |
2360 | 2361 | | |
2361 | 2362 | | |
2362 | 2363 | | |
2363 | 2364 | | |
2364 | 2365 | | |
2365 | | - | |
| 2366 | + | |
2366 | 2367 | | |
2367 | | - | |
2368 | | - | |
2369 | | - | |
| 2368 | + | |
| 2369 | + | |
| 2370 | + | |
2370 | 2371 | | |
2371 | 2372 | | |
2372 | 2373 | | |
| |||
0 commit comments