Skip to content

Commit 80badc0

Browse files
committed
fix: restrict stats init to sort pushdown path to avoid over-pruning
Stats init with max(min) threshold can over-prune for non-sorted data: the threshold may exceed the actual Kth value when rows are distributed across multiple RGs. This caused output_rows=0 in explain_analyze tests. Restrict stats init to sort pushdown path where data ordering guarantees the threshold is a valid lower bound. Keep fuzz test tiebreaker fix as it's independently correct (SQL doesn't guarantee tie-breaking order).
1 parent 18939c9 commit 80badc0

1 file changed

Lines changed: 4 additions & 4 deletions

File tree

datafusion/datasource-parquet/src/opener.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -838,10 +838,10 @@ impl MetadataLoadedParquetOpen {
838838
// BEFORE building the pruning predicate. The PruningPredicate compiles
839839
// the expression at build time, so the DynamicFilterPhysicalExpr must
840840
// already have the threshold set for pruning to be effective.
841-
// Works for ALL TopK queries. The filter is null-aware (NULLS FIRST
842-
// adds `col IS NULL OR ...`) to avoid incorrectly pruning RGs with
843-
// NULL values that belong in the result.
844-
{
841+
// Only when sort pushdown is active (sort_order_for_reorder set).
842+
// Stats init threshold may over-prune for non-sorted data where
843+
// max(min) across RGs can exceed the actual Kth value.
844+
if prepared.sort_order_for_reorder.is_some() {
845845
let file_metadata = reader_metadata.metadata();
846846
let rg_metadata = file_metadata.row_groups();
847847
let topk_limit = prepared.limit.unwrap_or(1);

0 commit comments

Comments
 (0)