You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -387,8 +387,8 @@ datafusion.execution.parquet.dictionary_page_size_limit 1048576 (writing) Sets b
387
387
datafusion.execution.parquet.enable_page_index true (reading) If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce the I/O and number of rows decoded.
388
388
datafusion.execution.parquet.encoding NULL (writing) Sets default encoding for any column. Valid values are: plain, plain_dictionary, rle, bit_packed, delta_binary_packed, delta_length_byte_array, delta_byte_array, rle_dictionary, and byte_stream_split. These values are not case sensitive. If NULL, uses default parquet writer setting
389
389
datafusion.execution.parquet.filter_correlation_threshold 1.5 (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) * P(B)): 1.0 = independent (keep separate for late materialization benefit) 1.5 = filters co-pass 50% more often than chance (default threshold) 2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely.
390
-
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec inf (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic).
391
-
datafusion.execution.parquet.filter_statistics_collection_fraction 0 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 (default) = disabled, use filter_statistics_collection_min_rows only. 0.05 = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
390
+
datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec 104857600 (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks on an m4 MacBook Pro.
391
+
datafusion.execution.parquet.filter_statistics_collection_fraction 0.05 (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction * total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0].
392
392
datafusion.execution.parquet.filter_statistics_collection_min_rows 10000 (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely.
393
393
datafusion.execution.parquet.force_filter_selections false (reading) Force the use of RowSelections for filter results, when pushdown_filters is enabled. If false, the reader will automatically choose between a RowSelection and a Bitmap based on the number and pattern of selected rows.
394
394
datafusion.execution.parquet.max_predicate_cache_size NULL (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching.
Copy file name to clipboardExpand all lines: docs/source/user-guide/configs.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,7 +92,7 @@ The following configuration settings are available:
92
92
| datafusion.execution.parquet.coerce_int96 | NULL | (reading) If true, parquet reader will read columns of physical type int96 as originating from a different resolution than nanosecond. This is useful for reading data from systems like Spark which stores microsecond resolution timestamps in an int96 allowing it to write values with a larger date range than 64-bit timestamps with nanosecond resolution. |
93
93
| datafusion.execution.parquet.bloom_filter_on_read | true | (reading) Use any available bloom filters when reading parquet files |
94
94
| datafusion.execution.parquet.max_predicate_cache_size | NULL | (reading) The maximum predicate cache size, in bytes. When `pushdown_filters` is enabled, sets the maximum memory used to cache the results of predicate evaluation between filter evaluation and output generation. Decreasing this value will reduce memory usage, but may increase IO and CPU usage. None means use the default parquet reader setting. 0 means no caching. |
95
-
| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec | inf | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY (default) = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). |
95
+
| datafusion.execution.parquet.filter_pushdown_min_bytes_per_sec | 104857600 | (reading) Minimum bytes/sec throughput for adaptive filter pushdown. Filters that achieve at least this throughput (bytes_saved / eval_time) are promoted to row filters. f64::INFINITY = no filters promoted (feature disabled). 0.0 = all filters pushed as row filters (no adaptive logic). Default: 104857600.0 (100 MB/s) — empirically tuned across TPC-H, TPC-DS, and ClickBench benchmarks. |
96
96
| datafusion.execution.parquet.filter_correlation_threshold | 1.5 | (reading) Correlation ratio threshold for grouping filters. The ratio is P(A ∧ B) / (P(A) \* P(B)): 1.0 = independent (keep separate for late materialization benefit) 1.5 = filters co-pass 50% more often than chance (default threshold) 2.0 = filters co-pass twice as often as chance (conservative) Higher values = less grouping = more late materialization, more overhead. Lower values = more grouping = less overhead, less late materialization. Set to f64::MAX to disable grouping entirely. |
97
97
| datafusion.execution.parquet.filter_statistics_collection_min_rows | 10000 | (reading) Minimum rows of post-scan evaluation before statistics-based optimization activates. During collection, all filters are evaluated as post-scan to gather accurate marginal and joint selectivity statistics. Used for BOTH individual filter effectiveness decisions AND correlation- based grouping. Larger values = more accurate estimates, longer collection. Set to 0 to disable the collection phase entirely. |
98
98
| datafusion.execution.parquet.filter_statistics_collection_fraction | 0.05 | (reading) Fraction of total dataset rows to use for the statistics collection phase. When > 0 and the dataset row count is known, the effective collection threshold is max(min_rows, fraction \* total_rows). 0.0 = disabled, use filter_statistics_collection_min_rows only. 0.05 (default) = collect stats on at least 5% of the dataset. Must be in [0.0, 1.0]. |
0 commit comments