Commit ca1d39d
perf: implement convert_to_state for SparkAvg (#21548)
## Which issue does this PR close?
- Part of #17964.
## Rationale for this change
SparkAvg's AvgGroupsAccumulator doesn't implement
supports_convert_to_state (defaults to false), which prevents the
skip-partial-aggregation optimization from kicking in for queries that
use Spark's avg().
I ran into this while benchmarking a Spark Connect engine built on
DataFusion. On TPC-H q17 at SF10, the partial aggregate for
avg(l_quantity) grouped by l_partkey (~2M groups out of 60M rows) was
not triggering skip-aggregation:
| Metric | Without convert_to_state | With convert_to_state |
|--------|-------------------------|-----------------------|
| Partial aggregate memory | 923 MB | 40 MB |
| Partial aggregate elapsed | 4.75s | 109ms |
The skip-aggregation probe (#11627) detects when a partial aggregate
isn't reducing cardinality and falls back to passing rows through as
state directly. This needs convert_to_state so the accumulator can
produce [sum, count] state arrays from raw input. The built-in Avg
already has this (#11734), but it wasn't carried over when SparkAvg was
migrated from Comet in #17871.
## What changes are included in this PR?
Adds convert_to_state() and supports_convert_to_state() to
AvgGroupsAccumulator in datafusion-spark.
Follows the same approach as the built-in Avg, adapted for SparkAvg's
differences:
- State order is [sum, count] (vs [count, sum] in the built-in)
- Count type is Int64 (vs UInt64 in the built-in)
- Null handling uses NullBuffer::union directly instead of pulling in
datafusion-functions-aggregate-common as a dep
Also cleaned up the fully-qualified arrow::array::BooleanArray
references in update_batch / merge_batch since adding BooleanArray to
the import block triggered the unused_qualifications lint.
## Are these changes tested?
Yes, unit tests covering basic conversion, null propagation, filter
handling, and a roundtrip through merge_batch to verify the converted
state produces correct results end-to-end.
## Are there any user-facing changes?
No. Queries using avg() through the Spark function registry will
automatically benefit from skip-partial-aggregation on high-cardinality
groupings.
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>1 parent 40b209e commit ca1d39d
3 files changed
Lines changed: 161 additions & 13 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| 58 | + | |
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
| 19 | + | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| |||
31 | 32 | | |
32 | 33 | | |
33 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
34 | 38 | | |
35 | 39 | | |
36 | 40 | | |
| |||
248 | 252 | | |
249 | 253 | | |
250 | 254 | | |
251 | | - | |
| 255 | + | |
252 | 256 | | |
253 | 257 | | |
254 | 258 | | |
| |||
285 | 289 | | |
286 | 290 | | |
287 | 291 | | |
288 | | - | |
| 292 | + | |
289 | 293 | | |
290 | 294 | | |
291 | 295 | | |
292 | 296 | | |
293 | 297 | | |
294 | 298 | | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | | - | |
301 | 299 | | |
302 | | - | |
| 300 | + | |
303 | 301 | | |
304 | | - | |
305 | | - | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
306 | 310 | | |
307 | | - | |
| 311 | + | |
308 | 312 | | |
309 | 313 | | |
310 | 314 | | |
| |||
343 | 347 | | |
344 | 348 | | |
345 | 349 | | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
346 | 376 | | |
347 | 377 | | |
348 | 378 | | |
349 | 379 | | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
0 commit comments