Skip to content

Commit 5989c6e

Browse files
committed
UNPICK fix 2603301701
1 parent 4d85851 commit 5989c6e

File tree

1 file changed

+192
-0
lines changed

1 file changed

+192
-0
lines changed

FIX_10.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# FIX 10: make restored syntactic null-restriction fast path conservative again
2+
3+
## What this plan is for
4+
5+
This plan is specifically for the new sqllogictest panics reported after the later
6+
commits in `b9828cabc^..a565c732b`.
7+
8+
It is **not** a retry of earlier fixes:
9+
10+
- `FIX_05.md` / `FIX_06.md`: scalar-subquery / cross-join filter-promotion regressions in
11+
`push_down_filter.rs`
12+
- `FIX_07.md`: semantic split between `is_restrict_null_predicate` and `evaluates_to_null`
13+
- `FIX_08.md`: benchmark drift and over-broad production work in the benchmark branch
14+
- `FIX_09.md`: restoring the syntactic fast path after it had been removed
15+
16+
Those threads should stay fixed. The new failure appears to come from the **restored**
17+
syntactic evaluator itself, not from the mixed-reference early return, not from
18+
`evaluates_to_null`, and not from scalar-subquery join-promotion logic.
19+
20+
## Best current root-cause hypothesis
21+
22+
The most likely culprit is the restored syntactic fast path in
23+
[`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs),
24+
re-entered from [`is_restrict_null_predicate`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs).
25+
26+
The key problem is here:
27+
28+
- [`datafusion/optimizer/src/utils/null_restriction.rs:65`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L65)
29+
- [`datafusion/optimizer/src/utils/null_restriction.rs:88`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L88)
30+
- [`datafusion/optimizer/src/utils.rs:152`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L152)
31+
- [`datafusion/optimizer/src/utils.rs:161`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L161)
32+
33+
`binary_boolean_value(...)` models SQL three-valued logic using a reduced lattice:
34+
35+
- `Null`
36+
- `NonNull`
37+
- `Boolean(bool)`
38+
39+
That is fine only if every combination that reaches the fallback arm is genuinely equivalent.
40+
The current code assumes that in the final arm:
41+
42+
```rust
43+
(left, right) => {
44+
debug_assert_eq!(left, right);
45+
left
46+
}
47+
```
48+
49+
but that assumption is not generally valid once the evaluator is fed real optimizer
50+
predicates composed from rewrites, wrappers, and partially-supported subexpressions.
51+
52+
The result is:
53+
54+
1. the syntactic evaluator returns `Some(...)` for expressions it cannot fully model safely
55+
2. `binary_boolean_value(...)` reaches mixed states it did not expect
56+
3. the new `debug_assert_eq!` panics in debug/test builds during planning
57+
4. sqllogictest reports that panic against whichever statement was active in that worker task,
58+
which explains why the surfaced failures span `aggregate.slt`, `subquery.slt`, and `union.slt`
59+
even though those SQL texts are not obviously about this helper
60+
61+
Even without the assertion, returning `left` in that arm would still be unsound, so the
62+
assert is exposing a real correctness hole in the fast path rather than creating one by itself.
63+
64+
## Why this matches the reported failures
65+
66+
The reported failures are broad and seemingly unrelated:
67+
68+
- aggregate regression query
69+
- correlated subquery regression query
70+
- union / except regression query
71+
72+
That pattern fits a planner-time panic in a shared optimizer helper much better than a
73+
query-specific execution bug.
74+
75+
Within this PR range, the restored syntactic null-restriction path is the most plausible
76+
shared change that:
77+
78+
- runs during optimization
79+
- is exercised across many plan shapes
80+
- added fresh `debug_assert_eq!` assumptions
81+
- was reintroduced after earlier notes had already identified risk in this area
82+
83+
## Fix plan
84+
85+
1. Make the syntactic evaluator conservative instead of assertive.
86+
87+
In [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs),
88+
change `binary_boolean_value(...)` so that any mixed residual state it cannot prove
89+
equivalent returns `None`, not `left`, and never asserts.
90+
91+
Concretely:
92+
93+
- remove the `debug_assert_eq!(left, right)` fallback
94+
- treat unmodelled combinations as "unknown to syntactic evaluator"
95+
- let the caller fall back to `authoritative_restrict_null_predicate(...)`
96+
97+
2. Tighten when `syntactic_restrict_null_predicate(...)` is allowed to return `Some(...)`.
98+
99+
Audit the current supported nodes in
100+
[`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs)
101+
and make sure it only returns a definitive answer for expressions whose null-substitution
102+
result is proven by construction.
103+
104+
In particular, re-check:
105+
106+
- `AND` / `OR` combinations of partially-known boolean states
107+
- wrappers like `IS [NOT] NULL`
108+
- casts / negation / LIKE
109+
- interactions where one side is known boolean and the other side is only "non-null"
110+
111+
3. Keep the mixed-reference early return from Sub-issue B.
112+
113+
Do **not** remove the cheap bailout in
114+
[`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L135).
115+
That early `Ok(false)` for predicates referencing columns outside the join-key set is still
116+
the intended optimization and is not the leading suspect for this regression.
117+
118+
4. Do not reopen the earlier semantic fixes.
119+
120+
Preserve:
121+
122+
- the caller split between `is_restrict_null_predicate(...)` and `evaluates_to_null(...)`
123+
- the scalar-subquery / cross-join `PushDownFilter` fixes already captured in earlier plans
124+
125+
This fix should stay narrowly focused on the restored syntactic fast path.
126+
127+
5. Add differential regression tests that would have caught this.
128+
129+
In [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs),
130+
extend the fast-path-vs-authoritative tests so they cover **composed boolean predicates**,
131+
not just simple scalar cases.
132+
133+
Add reduced cases for:
134+
135+
- `AND` with one identity side and one partially-known side
136+
- `OR` with one identity side and one partially-known side
137+
- combinations involving `IS NULL` / `IS NOT NULL`
138+
- CASE-derived boolean predicates if they can now reach the syntactic path indirectly
139+
140+
The invariant should be:
141+
142+
- if syntactic evaluation returns `Some(x)`, authoritative evaluation must also return `x`
143+
- otherwise syntactic evaluation must return `None`
144+
145+
6. Add one optimizer-level repro derived from real failing traffic.
146+
147+
Because sqllogictest surfaced the panic through parallel worker tasks, add at least one
148+
optimizer-level regression that exercises `PushDownFilter` null-restriction analysis on a
149+
realistic composed predicate instead of relying only on hand-picked helper-unit cases.
150+
151+
Preferred location:
152+
153+
- [`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs)
154+
- or an optimizer integration test if that is easier to minimize
155+
156+
7. Reproduce with backtrace once the workspace compiles cleanly.
157+
158+
This checkout currently has an unrelated compile blocker in
159+
[`datafusion/physical-expr-common/src/metrics/baseline.rs`](/Users/kosiew/GitHub/datafusion/datafusion/physical-expr-common/src/metrics/baseline.rs)
160+
(`MetricType::Dev` vs `MetricType::DEV`), which prevented a local sqllogictest backtrace.
161+
162+
After that is resolved, rerun the smallest useful reproductions first:
163+
164+
```bash
165+
cargo test -p datafusion-optimizer -- utils
166+
cargo test -p datafusion-optimizer -- push_down_filter
167+
cargo test -p datafusion-sqllogictest --test sqllogictests -- subquery.slt
168+
cargo test -p datafusion-sqllogictest --test sqllogictests -- union.slt
169+
cargo test -p datafusion-sqllogictest --test sqllogictests -- aggregate.slt
170+
```
171+
172+
8. Finish with required repo checks.
173+
174+
```bash
175+
cargo fmt --all
176+
cargo clippy --all-targets --all-features -- -D warnings
177+
```
178+
179+
## Expected code touch points
180+
181+
- [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs)
182+
- [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs)
183+
- possibly one optimizer regression test in
184+
[`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs)
185+
186+
## Short version
187+
188+
The likely regression is that the restored syntactic null-restriction evaluator is no longer
189+
fully conservative: it reaches mixed boolean states it cannot soundly model, then asserts
190+
instead of declining and falling back to authoritative evaluation. The fix is to make that
191+
fast path conservative again: unsupported or mixed states must return `None`, not panic and
192+
not guess.

0 commit comments

Comments
 (0)