Skip to content

Commit 491d495

Browse files
committed
update slts
1 parent 1bd60e9 commit 491d495

3 files changed

Lines changed: 12 additions & 7 deletions

File tree

datafusion/common/src/config.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -754,9 +754,14 @@ config_namespace! {
754754
/// (reading) Minimum bytes/sec throughput for adaptive filter pushdown.
755755
/// Filters that achieve at least this throughput (bytes_saved / eval_time)
756756
/// are promoted to row filters.
757-
/// f64::INFINITY (default) = no filters promoted (feature disabled).
757+
/// f64::INFINITY = no filters promoted (feature disabled).
758758
/// 0.0 = all filters pushed as row filters (no adaptive logic).
759-
pub filter_pushdown_min_bytes_per_sec: f64, default = f64::INFINITY
759+
/// Default: 104857600.0 (100 MB/s) — empirically tuned across
760+
/// TPC-H, TPC-DS, and ClickBench benchmarks on an m4 MacBook Pro.
761+
/// The optimal value for this setting likely depeonds on the relative
762+
/// cost of CPU vs. IO in your environment, and to some extent the shape
763+
/// of your query.
764+
pub filter_pushdown_min_bytes_per_sec: f64, default = 104_857_600.0
760765

761766
/// (reading) Correlation ratio threshold for grouping filters.
762767
/// The ratio is P(A ∧ B) / (P(A) * P(B)):

datafusion/sqllogictest/test_files/information_schema.slt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -245,8 +245,8 @@ datafusion.execution.parquet.dictionary_page_size_limit 1048576
245245
datafusion.execution.parquet.enable_page_index true
246246
datafusion.execution.parquet.encoding NULL
247247
datafusion.execution.parquet.filter_correlation_threshold 1.5
248-
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec inf
249-
datafusion.execution.parquet.filter_statistics_collection_fraction 0
248+
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec 104857600
249+
datafusion.execution.parquet.filter_statistics_collection_fraction 0.05
250250
datafusion.execution.parquet.filter_statistics_collection_min_rows 10000
251251
datafusion.execution.parquet.force_filter_selections false
252252
datafusion.execution.parquet.max_predicate_cache_size NULL
@@ -387,8 +387,8 @@ datafusion.execution.parquet.dictionary_page_size_limit 1048576 (writing) Sets b
387387
datafusion.execution.parquet.enable_page_index true (reading) If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce the I/O and number of rows decoded.
388388
datafusion.execution.parquet.encoding NULL (writing) Sets default encoding for any column. Valid values are: plain, plain_dictionary, rle, bit_packed, delta_binary_packed, delta_length_byte_array, delta_byte_array, rle_dictionary, and byte_stream_split. These values are not case sensitive. If NULL, uses default parquet writer setting
389389
datafusion.execution.parquet.filter_correlation_threshold 1.5 (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) * P(B)): 1.0 = independent (keep separate for late materialization benefit) 1.5 = filters co-pass 50% more often than chance (default threshold) 2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely.
390-
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec inf (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic).
391-
datafusion.execution.parquet.filter_statistics_collection_fraction 0 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 (default) = disabled, use filter_statistics_collection_min_rows only. 0.05 = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
390+
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec 104857600 (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks on an m4 MacBook Pro.
391+
datafusion.execution.parquet.filter_statistics_collection_fraction 0.05 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
392392
datafusion.execution.parquet.filter_statistics_collection_min_rows 10000 (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely.
393393
datafusion.execution.parquet.force_filter_selections false (reading) Force the use of RowSelections for filter results, when pushdown_filters is enabled. If false, the reader will automatically choose between a RowSelection and a Bitmap based on the number and pattern of selected rows.
394394
datafusion.execution.parquet.max_predicate_cache_size NULL (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching.

docs/source/user-guide/configs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ The following configuration settings are available:
9292
| datafusion.execution.parquet.coerce_int96 | NULL | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution. |
9393
| datafusion.execution.parquet.bloom_filter_on_read | true | (reading) Use any available bloom filters when reading parquet files |
9494
| datafusion.execution.parquet.max_predicate_cache_size | NULL | (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching. |
95-
| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec | inf | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). |
95+
| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec | 104857600 | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks. |
9696
| datafusion.execution.parquet.filter_correlation_threshold | 1.5 | (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) \* P(B)): 1.0 = independent (keep separate for late materialization benefit) 1.5 = filters co-pass 50% more often than chance (default threshold) 2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely. |
9797
| datafusion.execution.parquet.filter_statistics_collection_min_rows | 10000 | (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely. |
9898
| datafusion.execution.parquet.filter_statistics_collection_fraction | 0.05 | (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction \* total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0]. |

0 commit comments

Comments
 (0)