update slts

adriangb · adriangb · commit 491d4951c45d · 2026-02-15T07:48:07.000-05:00
diff --git a/datafusion/common/src/config.rs b/datafusion/common/src/config.rs
@@ -754,9 +754,14 @@ config_namespace! {
         /// (reading) Minimum bytes/sec throughput for adaptive filter pushdown.
         /// Filters that achieve at least this throughput (bytes_saved / eval_time)
         /// are promoted to row filters.
-        /// f64::INFINITY (default) = no filters promoted (feature disabled).
+        /// f64::INFINITY = no filters promoted (feature disabled).
         /// 0.0 = all filters pushed as row filters (no adaptive logic).
-        pub filter_pushdown_min_bytes_per_sec: f64, default = f64::INFINITY
+        /// Default: 104857600.0 (100 MB/s) — empirically tuned across
+        /// TPC-H, TPC-DS, and ClickBench benchmarks on an m4 MacBook Pro.
+        /// The optimal value for this setting likely depeonds on the relative
+        /// cost of CPU vs. IO in your environment, and to some extent the shape
+        /// of your query.
+        pub filter_pushdown_min_bytes_per_sec: f64, default = 104_857_600.0
 
         /// (reading) Correlation ratio threshold for grouping filters.
         /// The ratio is P(A ∧ B) / (P(A) * P(B)):
diff --git a/datafusion/sqllogictest/test_files/information_schema.slt b/datafusion/sqllogictest/test_files/information_schema.slt
@@ -245,8 +245,8 @@ datafusion.execution.parquet.dictionary_page_size_limit 1048576
 datafusion.execution.parquet.enable_page_index true
 datafusion.execution.parquet.encoding NULL
 datafusion.execution.parquet.filter_correlation_threshold 1.5
-datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec inf
-datafusion.execution.parquet.filter_statistics_collection_fraction 0
+datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec 104857600
+datafusion.execution.parquet.filter_statistics_collection_fraction 0.05
 datafusion.execution.parquet.filter_statistics_collection_min_rows 10000
 datafusion.execution.parquet.force_filter_selections false
 datafusion.execution.parquet.max_predicate_cache_size NULL
@@ -387,8 +387,8 @@ datafusion.execution.parquet.dictionary_page_size_limit 1048576 (writing) Sets b
 datafusion.execution.parquet.enable_page_index true (reading) If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce the I/O and number of rows decoded.
 datafusion.execution.parquet.encoding NULL (writing)  Sets default encoding for any column. Valid values are: plain, plain_dictionary, rle, bit_packed, delta_binary_packed, delta_length_byte_array, delta_byte_array, rle_dictionary, and byte_stream_split. These values are not case sensitive. If NULL, uses default parquet writer setting
 datafusion.execution.parquet.filter_correlation_threshold 1.5 (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) * P(B)):   1.0 = independent (keep separate for late materialization benefit)   1.5 = filters co-pass 50% more often than chance (default threshold)   2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely.
-datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec inf (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic).
-datafusion.execution.parquet.filter_statistics_collection_fraction 0 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 (default) = disabled, use filter_statistics_collection_min_rows only. 0.05 = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
+datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec 104857600 (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks on an m4 MacBook Pro.
+datafusion.execution.parquet.filter_statistics_collection_fraction 0.05 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
 datafusion.execution.parquet.filter_statistics_collection_min_rows 10000 (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely.
 datafusion.execution.parquet.force_filter_selections false (reading) Force the use of RowSelections for filter results, when pushdown_filters is enabled. If false, the reader will automatically choose between a RowSelection and a Bitmap based on the number and pattern of selected rows.
 datafusion.execution.parquet.max_predicate_cache_size NULL (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching.
diff --git a/docs/source/user-guide/configs.md b/docs/source/user-guide/configs.md
@@ -92,7 +92,7 @@ The following configuration settings are available:
 | datafusion.execution.parquet.coerce_int96                               | NULL                      | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
 | datafusion.execution.parquet.bloom_filter_on_read                       | true                      | (reading) Use any available bloom filters when reading parquet files                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
 | datafusion.execution.parquet.max_predicate_cache_size                   | NULL                      | (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec          | inf                       | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec          | 104857600                 | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
 | datafusion.execution.parquet.filter_correlation_threshold               | 1.5                       | (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) \* P(B)): 1.0 = independent (keep separate for late materialization benefit) 1.5 = filters co-pass 50% more often than chance (default threshold) 2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
 | datafusion.execution.parquet.filter_statistics_collection_min_rows      | 10000                     | (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 | datafusion.execution.parquet.filter_statistics_collection_fraction      | 0.05                      | (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction \* total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |