You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Informs: datafusion-contrib/datafusion-distributed#180Closes: #20418
Consider you have a plan with a `HashJoinExec` and `DataSourceExec`
```
HashJoinExec(dynamic_filter_1 on a@0)
(...left side of join)
ProjectionExec(a := Column("a", source_index))
DataSourceExec
ParquetSource(predicate = dynamic_filter_2)
```
You serialize the plan, deserialize it, and execute it. What should happen is that the dynamic filter should "work", meaning:
1. When you deserialize the plan, both the `HashJoinExec` and `DataSourceExec` should have pointers to the same `DynamicFilterPhysicalExpr`
2. The `DynamicFilterPhysicalExpr` should be updated during execution by the `HashJoinExec` and the `DataSourceExec` should filter out rows
This does not happen today for a few reasons, a couple of which this PR aims to address
1. `DynamicFilterPhysicalExpr` is not survive round-tripping. The internal exprs get inlined (ex. it may be serialized as `Literal`) due to the `PhysicalExpr::snapshot()` API
2. Even if `DynamicFilterPhysicalExpr` survives round-tripping, the one pushed down to the `DataSourceExec` often has different children. In this case, you have two `DynamicFilterPhysicalExpr` which
do not survive deduping, causing referential integrity to be lost.
This PR aims to fix those problems by:
1. Removing the `snapshot()` call from the serialization process
2. Adding protos for `DynamicFilterPhysicalExpr` so it can be serialized and deserialized
3. Adding a new concept, a `PhysicalExprId`, which has two identifiers,
a "shallow" identifier to indicate two equal expressions which may
have different children, and an "exact" identifier to indicate two
exprs that are exactly the same.
4. Updating the deduping deserializer and protos to now be aware of the
new "shallow" id, deduping exprs which are the same but have
different children accordingly.
This change adds tests which roundtrip dynamic filters and assert that
referential integrity is maintained.
Copy file name to clipboardExpand all lines: datafusion/physical-expr/src/expressions/dynamic_filters.rs
+252-1Lines changed: 252 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,7 @@ use datafusion_common::{
27
27
};
28
28
use datafusion_expr::ColumnarValue;
29
29
use datafusion_physical_expr_common::physical_expr::DynHash;
30
+
use rand::random;
30
31
31
32
/// State of a dynamic filter, tracking both updates and completion.
32
33
#[derive(Debug,Clone,Copy,PartialEq,Eq)]
@@ -55,7 +56,6 @@ impl FilterState {
55
56
/// For more background, please also see the [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries blog]
56
57
///
57
58
/// [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters
58
-
#[derive(Debug)]
59
59
pubstructDynamicFilterPhysicalExpr{
60
60
/// The original children of this PhysicalExpr, if any.
61
61
/// This is necessary because the dynamic filter may be initialized with a placeholder (e.g. `lit(true)`)
0 commit comments