Commit 3996178
committed
feat: initialize TopK dynamic filter threshold from parquet statistics
Before reading any parquet data, scan row group min/max statistics to
compute an initial threshold for TopK's dynamic filter. This allows
row-level filtering to benefit immediately from the first file opened,
rather than waiting until TopK processes enough rows to build a
threshold organically.
Algorithm (single-column sort):
- DESC LIMIT K: threshold = max(min) across RGs with num_rows >= K
Filter: col > threshold
- ASC LIMIT K: threshold = min(max) across RGs with num_rows >= K
Filter: col < threshold
Sort direction is read from sort_options on DynamicFilterPhysicalExpr,
which is now set by SortExec::create_filter() for TopK queries. This
makes the optimization work for ALL TopK queries on parquet, not just
those with sort pushdown.
The DynamicFilterPhysicalExpr is shared across all partitions, so
each file's threshold update is visible to subsequent files globally.
Graceful fallback: skips initialization when sort_options is absent,
statistics are unavailable, column not found, or multi-column sort.1 parent 88bdaac commit 3996178
3 files changed
Lines changed: 655 additions & 2 deletions
File tree
- datafusion
- datasource-parquet/src
- physical-expr/src/expressions
- physical-plan/src/sorts
0 commit comments