Skip to content

Commit 9d6b67b

Browse files
committed
fix: stats init only safe for sorted (non-overlapping) RGs
The max(min)/min(max) algorithm is only a valid threshold bound when RGs are non-overlapping (guaranteed by sorted data with sort pushdown). For overlapping RGs, top-K values may span multiple RGs and the threshold can over-prune, producing fewer results than expected. Keep stats init restricted to sort pushdown path. Keep fuzz test tiebreaker fix (independently correct).
1 parent 80badc0 commit 9d6b67b

1 file changed

Lines changed: 9 additions & 7 deletions

File tree

datafusion/datasource-parquet/src/opener.rs

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -834,13 +834,9 @@ impl MetadataLoadedParquetOpen {
834834
}
835835
prepared.physical_file_schema = Arc::clone(&physical_file_schema);
836836

837-
// Initialize TopK dynamic filter threshold from row group statistics
838-
// BEFORE building the pruning predicate. The PruningPredicate compiles
839-
// the expression at build time, so the DynamicFilterPhysicalExpr must
840-
// already have the threshold set for pruning to be effective.
841-
// Only when sort pushdown is active (sort_order_for_reorder set).
842-
// Stats init threshold may over-prune for non-sorted data where
843-
// max(min) across RGs can exceed the actual Kth value.
837+
// For sort pushdown path: initialize TopK threshold BEFORE building
838+
// PruningPredicate so that the compiled predicate includes the threshold.
839+
// This is safe because sort pushdown data is sorted and non-overlapping.
844840
if prepared.sort_order_for_reorder.is_some() {
845841
let file_metadata = reader_metadata.metadata();
846842
let rg_metadata = file_metadata.row_groups();
@@ -969,6 +965,12 @@ impl FiltersPreparedParquetOpen {
969965
.add_matched(remaining);
970966
}
971967

968+
// Note: stats init for non-sort-pushdown TopK is not safe because
969+
// max(min)/min(max) is only a valid threshold bound when RGs are
970+
// non-overlapping (sorted data). For overlapping RGs, the top-K
971+
// values may span multiple RGs and the threshold can over-prune.
972+
// Future: use a safer algorithm (e.g. sum of qualifying rows).
973+
972974
Ok(RowGroupsPrunedParquetOpen {
973975
prepared: self,
974976
row_groups,

0 commit comments

Comments
 (0)