You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Informs: datafusion-contrib/datafusion-distributed#180Closes: #20418
Consider you have a plan with a `HashJoinExec` and `DataSourceExec`
```
HashJoinExec(dynamic_filter_1 on a@0)
(...left side of join)
ProjectionExec(a := Column("a", source_index))
DataSourceExec
ParquetSource(predicate = dynamic_filter_2)
```
You serialize the plan, deserialize it, and execute it. What should happen is that the dynamic filter should "work", meaning:
1. When you deserialize the plan, both the `HashJoinExec` and `DataSourceExec` should have pointers to the same `DynamicFilterPhysicalExpr`
2. The `DynamicFilterPhysicalExpr` should be updated during execution by the `HashJoinExec` and the `DataSourceExec` should filter out rows
This does not happen today for a few reasons, a couple of which this PR aims to address
1. `DynamicFilterPhysicalExpr` is not survive round-tripping. The internal exprs get inlined (ex. it may be serialized as `Literal`) due to the `PhysicalExpr::snapshot()` API
2. Even if `DynamicFilterPhysicalExpr` survives round-tripping, the one pushed down to the `DataSourceExec` often has different children. In this case, you have two `DynamicFilterPhysicalExpr` which
do not survive deduping, causing referential integrity to be lost.
This PR aims to fix those problems by:
1. Removing the `snapshot()` call from the serialization process
2. Adding protos for `DynamicFilterPhysicalExpr` so it can be serialized and deserialized
3. Removing `Arc`-based deduplication. We now only dedupe on
`expression_id` if the `PhysicalExpr` reports a `expression_id`.
After this change, only `DynamicFilterPhysicalExpr` reports an `expression_id`
to be deduped.
4. `expression_id` is now just a random u64. Since a given query likely
only has a few `DynamicFilterPhysicalExpr` instances, the odds of a
collision are very low
5. There's no need for a `DedupingSerializer` anymore since the
`expression_id` is already stored in the dynamic filter proto itself
Testing
- adds tests which roundtrip dynamic filters and assert that referential
integrity is maintained
- removes tests that test `Arc`-based deduplication and session id
rotation
Copy file name to clipboardExpand all lines: datafusion/physical-expr/src/expressions/dynamic_filters.rs
+219-6Lines changed: 219 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -27,6 +27,7 @@ use datafusion_common::{
27
27
};
28
28
use datafusion_expr::ColumnarValue;
29
29
use datafusion_physical_expr_common::physical_expr::DynHash;
30
+
use rand::random;
30
31
31
32
/// State of a dynamic filter, tracking both updates and completion.
32
33
#[derive(Debug,Clone,Copy,PartialEq,Eq)]
@@ -55,7 +56,6 @@ impl FilterState {
55
56
/// For more background, please also see the [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries blog]
56
57
///
57
58
/// [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters
58
-
#[derive(Debug)]
59
59
pubstructDynamicFilterPhysicalExpr{
60
60
/// The original children of this PhysicalExpr, if any.
61
61
/// This is necessary because the dynamic filter may be initialized with a placeholder (e.g. `lit(true)`)
0 commit comments