Skip to content

Commit f0c5306

Browse files
docs(optimizer): add generated optimizer rules reference (#21824)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #21771. ## Rationale for this change There is no reference page that lists the built-in analyzer, logical optimizer, and physical optimizer rules. This PR adds that missing reference. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Add a generated optimizer rules reference page. - Add a renderer and generator binary for the optimizer rules docs. - Keep the rule documentation metadata private in core so this does not add new public optimizer APIs. - Link the new page from the query optimizer guide and the docs index. - Add tests that check the documented rule order matches the default analyzer, logical optimizer, and physical optimizer pipelines. - Fix the `eliminate_join` description so it matches the actual rule behavior. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No Public API Change <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
1 parent 66980e2 commit f0c5306

5 files changed

Lines changed: 187 additions & 1 deletion

File tree

datafusion/core/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
name = "datafusion"
2020
description = "DataFusion is an in-memory query engine that uses Apache Arrow as the memory model"
2121
keywords = ["arrow", "query", "sql"]
22-
include = ["benches/*.rs", "src/**/*.rs", "Cargo.toml", "LICENSE.txt", "NOTICE.txt"]
22+
include = ["benches/*.rs", "src/**/*.md", "src/**/*.rs", "Cargo.toml", "LICENSE.txt", "NOTICE.txt"]
2323
readme = "../../README.md"
2424
version = { workspace = true }
2525
edition = { workspace = true }

datafusion/core/src/lib.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -761,6 +761,7 @@
761761
//! [`RecordBatch`]: arrow::array::RecordBatch
762762
//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
763763
//! [`Array`]: arrow::array::Array
764+
#![doc = include_str!("optimizer_rule_reference.md")]
764765

765766
extern crate core;
766767
#[cfg(feature = "sql")]
@@ -786,6 +787,9 @@ pub use parquet;
786787
#[cfg(feature = "avro")]
787788
pub use datafusion_datasource_avro::arrow_avro;
788789

790+
#[cfg(test)]
791+
mod optimizer_rule_reference;
792+
789793
// re-export DataFusion sub-crates at the top level. Use `pub use *`
790794
// so that the contents of the subcrates appears in rustdocs
791795
// for details, see https://github.com/apache/datafusion/issues/6648
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
## Built-in Optimizer Rules
21+
22+
DataFusion applies a default analyzer, logical optimizer, and physical
23+
optimizer pipeline.
24+
25+
The rule names listed here match the names shown by `EXPLAIN VERBOSE`.
26+
27+
Rule order matters. The default pipeline may change between releases.
28+
29+
### Analyzer Rules
30+
31+
| order | rule | summary |
32+
| ----- | --------------------------- | --------------------------------------------------------------------------------------- |
33+
| 1 | `resolve_grouping_function` | Rewrites `GROUPING(...)` calls into expressions over DataFusion's internal grouping id. |
34+
| 2 | `type_coercion` | Adds implicit casts so operators and functions receive valid input types. |
35+
36+
### Logical Optimizer Rules
37+
38+
| order | rule | summary |
39+
| ----- | ----------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
40+
| 1 | `rewrite_set_comparison` | Rewrites `ANY` and `ALL` set-comparison subqueries into `EXISTS`-based boolean expressions with correct SQL NULL semantics. |
41+
| 2 | `optimize_unions` | Flattens nested unions and removes unions with a single input. |
42+
| 3 | `simplify_expressions` | Constant-folds and simplifies expressions while preserving output names. |
43+
| 4 | `replace_distinct_aggregate` | Rewrites `DISTINCT` and `DISTINCT ON` operators into aggregate-based plans that later rules can optimize further. |
44+
| 5 | `eliminate_join` | Replaces keyless inner joins with a literal `false` filter by an empty relation. |
45+
| 6 | `decorrelate_predicate_subquery` | Converts eligible `IN` and `EXISTS` predicate subqueries into semi or anti joins. |
46+
| 7 | `scalar_subquery_to_join` | Rewrites eligible scalar subqueries into joins and adds schema-preserving projections. |
47+
| 8 | `decorrelate_lateral_join` | Rewrites eligible lateral joins into regular joins. |
48+
| 9 | `extract_equijoin_predicate` | Splits join filters into equijoin keys and residual predicates. |
49+
| 10 | `eliminate_duplicated_expr` | Removes duplicate expressions from projections, aggregates, and similar operators. |
50+
| 11 | `eliminate_filter` | Drops always-true filters and replaces always-false or NULL filters with empty relations. |
51+
| 12 | `eliminate_cross_join` | Uses filter predicates to replace cross joins with inner joins when join keys can be found. |
52+
| 13 | `eliminate_limit` | Removes no-op limits and simplifies trivial limit shapes. |
53+
| 14 | `propagate_empty_relation` | Pushes empty-relation knowledge upward so operators fed by no rows collapse early. |
54+
| 15 | `filter_null_join_keys` | Adds `IS NOT NULL` filters to nullable equijoin keys that can never match. |
55+
| 16 | `eliminate_outer_join` | Rewrites outer joins to inner joins when later filters reject the NULL-extended rows. |
56+
| 17 | `push_down_limit` | Moves literal limits closer to scans and unions and merges adjacent limits. |
57+
| 18 | `push_down_filter` | Moves filters as early as possible through filter-commutative operators. |
58+
| 19 | `single_distinct_aggregation_to_group_by` | Rewrites single-column `DISTINCT` aggregations into two-stage `GROUP BY` plans. |
59+
| 20 | `eliminate_group_by_constant` | Removes constant or functionally redundant expressions from `GROUP BY`. |
60+
| 21 | `common_sub_expression_eliminate` | Computes repeated subexpressions once and reuses the result. |
61+
| 22 | `extract_leaf_expressions` | Pulls cheap leaf expressions closer to data sources so later pruning and filter rules can act earlier. |
62+
| 23 | `push_down_leaf_projections` | Pushes the helper projections created by leaf extraction toward leaf inputs. |
63+
| 24 | `optimize_projections` | Prunes unused columns and removes unnecessary logical projections. |
64+
65+
### Physical Optimizer Rules
66+
67+
The same rule name may appear more than once when the default pipeline runs it
68+
in multiple phases.
69+
70+
| order | rule | phase | summary |
71+
| ----- | ------------------------------ | ----------------------- | ------------------------------------------------------------------------------------------------------------ |
72+
| 1 | `OutputRequirements` | add phase | Adds helper nodes so output requirements survive later physical rewrites. |
73+
| 2 | `aggregate_statistics` | - | Uses exact source statistics to answer some aggregates without scanning data. |
74+
| 3 | `join_selection` | - | Chooses join implementation, build side, and partition mode from statistics and stream properties. |
75+
| 4 | `LimitedDistinctAggregation` | - | Pushes limit hints into grouped distinct-style aggregations when only a small result is needed. |
76+
| 5 | `FilterPushdown` | pre-optimization phase | Pushes supported physical filters down toward data sources before distribution and sorting are enforced. |
77+
| 6 | `EnforceDistribution` | - | Adds repartitioning only where needed to satisfy physical distribution requirements. |
78+
| 7 | `CombinePartialFinalAggregate` | - | Collapses adjacent partial and final aggregates when the distributed shape makes them redundant. |
79+
| 8 | `EnforceSorting` | - | Adds or removes local sorts to satisfy required input orderings. |
80+
| 9 | `OptimizeAggregateOrder` | - | Updates aggregate expressions to use the best ordering once sort requirements are known. |
81+
| 10 | `WindowTopN` | - | Replaces eligible row-number window and filter patterns with per-partition TopK execution. |
82+
| 11 | `ProjectionPushdown` | early pass | Pushes projections toward inputs before later physical rewrites add more limit and TopK structure. |
83+
| 12 | `OutputRequirements` | remove phase | Removes the temporary output-requirement helper nodes after requirement-sensitive planning is done. |
84+
| 13 | `LimitAggregation` | - | Passes a limit hint into eligible aggregations so they can keep fewer accumulator buckets. |
85+
| 14 | `LimitPushPastWindows` | - | Pushes fetch limits through bounded window operators when doing so keeps the result correct. |
86+
| 15 | `HashJoinBuffering` | - | Adds buffering on the probe side of hash joins so probing can start before build completion. |
87+
| 16 | `LimitPushdown` | - | Moves physical limits into child operators or fetch-enabled variants to cut data early. |
88+
| 17 | `TopKRepartition` | - | Pushes TopK below hash repartition when the partition key is a prefix of the sort key. |
89+
| 18 | `ProjectionPushdown` | late pass | Runs projection pushdown again after limit and TopK rewrites expose new pruning opportunities. |
90+
| 19 | `PushdownSort` | - | Pushes sort requirements into data sources that can already return sorted output. |
91+
| 20 | `EnsureCooperative` | - | Wraps non-cooperative plan parts so long-running tasks yield fairly. |
92+
| 21 | `FilterPushdown(Post)` | post-optimization phase | Pushes dynamic filters at the end of optimization, after plan references stop moving. |
93+
| 22 | `SanityCheckPlan` | - | Validates that the final physical plan meets ordering, distribution, and infinite-input safety requirements. |
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
use datafusion_optimizer::analyzer::Analyzer;
19+
use datafusion_optimizer::optimizer::Optimizer;
20+
use datafusion_physical_optimizer::optimizer::PhysicalOptimizer;
21+
22+
const OPTIMIZER_RULE_REFERENCE: &str = include_str!("optimizer_rule_reference.md");
23+
24+
fn documented_rules(section_heading: &str) -> Vec<String> {
25+
let mut in_section = false;
26+
let mut names = vec![];
27+
28+
for line in OPTIMIZER_RULE_REFERENCE.lines() {
29+
if line == section_heading {
30+
in_section = true;
31+
continue;
32+
}
33+
34+
if in_section && line.starts_with("### ") {
35+
break;
36+
}
37+
38+
if !in_section || !line.starts_with('|') || line.contains("---") {
39+
continue;
40+
}
41+
42+
let columns: Vec<_> = line.split('|').map(str::trim).collect();
43+
44+
if columns.len() < 4 || columns[1] == "order" {
45+
continue;
46+
}
47+
48+
names.push(columns[2].trim_matches('`').to_string());
49+
}
50+
51+
names
52+
}
53+
54+
#[test]
55+
fn analyzer_rules_match_documented_order() {
56+
let rules: Vec<_> = Analyzer::new()
57+
.rules
58+
.iter()
59+
.map(|rule| rule.name().to_string())
60+
.collect();
61+
62+
assert_eq!(documented_rules("### Analyzer Rules"), rules);
63+
}
64+
65+
#[test]
66+
fn logical_rules_match_documented_order() {
67+
let rules: Vec<_> = Optimizer::new()
68+
.rules
69+
.iter()
70+
.map(|rule| rule.name().to_string())
71+
.collect();
72+
73+
assert_eq!(documented_rules("### Logical Optimizer Rules"), rules);
74+
}
75+
76+
#[test]
77+
fn physical_rules_match_documented_order() {
78+
let rules: Vec<_> = PhysicalOptimizer::new()
79+
.rules
80+
.iter()
81+
.map(|rule| rule.name().to_string())
82+
.collect();
83+
84+
assert_eq!(documented_rules("### Physical Optimizer Rules"), rules);
85+
}

docs/source/library-user-guide/query-optimizer.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ This crate is a submodule of DataFusion that provides a query optimizer for logi
2828
contains an extensive set of [`OptimizerRule`]s and [`PhysicalOptimizerRule`]s that may rewrite the plan and/or its expressions so
2929
they execute more quickly while still computing the same result.
3030

31+
For a reference list of the built-in analyzer, logical optimizer, and physical optimizer rules,
32+
see [Optimizer Rule Reference].
33+
3134
For a deeper background on optimizer architecture and rule types and predicates, see
3235
[Optimizing SQL (and DataFrames) in DataFusion, Part 1], [Part 2],
3336
[Using Ordering for Better Plans in Apache DataFusion], and
@@ -39,6 +42,7 @@ For a deeper background on optimizer architecture and rule types and predicates,
3942
[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two
4043
[using ordering for better plans in apache datafusion]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis
4144
[dynamic filters: passing information between operators during execution for 25x faster queries]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters
45+
[optimizer rule reference]: https://docs.rs/datafusion/latest/datafusion/index.html#built-in-optimizer-rules
4246
[`logicalplan`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
4347

4448
## Running the Optimizer

0 commit comments

Comments
 (0)