UNPICK fix 2603301701

kosiew · kosiew · commit 5989c6e74f00 · 2026-03-30T19:49:33.000+08:00
diff --git a/FIX_10.md b/FIX_10.md
@@ -0,0 +1,192 @@
+# FIX 10: make restored syntactic null-restriction fast path conservative again
+
+## What this plan is for
+
+This plan is specifically for the new sqllogictest panics reported after the later
+commits in `b9828cabc^..a565c732b`.
+
+It is **not** a retry of earlier fixes:
+
+- `FIX_05.md` / `FIX_06.md`: scalar-subquery / cross-join filter-promotion regressions in
+  `push_down_filter.rs`
+- `FIX_07.md`: semantic split between `is_restrict_null_predicate` and `evaluates_to_null`
+- `FIX_08.md`: benchmark drift and over-broad production work in the benchmark branch
+- `FIX_09.md`: restoring the syntactic fast path after it had been removed
+
+Those threads should stay fixed. The new failure appears to come from the **restored**
+syntactic evaluator itself, not from the mixed-reference early return, not from
+`evaluates_to_null`, and not from scalar-subquery join-promotion logic.
+
+## Best current root-cause hypothesis
+
+The most likely culprit is the restored syntactic fast path in
+[`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs),
+re-entered from [`is_restrict_null_predicate`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs).
+
+The key problem is here:
+
+- [`datafusion/optimizer/src/utils/null_restriction.rs:65`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L65)
+- [`datafusion/optimizer/src/utils/null_restriction.rs:88`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L88)
+- [`datafusion/optimizer/src/utils.rs:152`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L152)
+- [`datafusion/optimizer/src/utils.rs:161`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L161)
+
+`binary_boolean_value(...)` models SQL three-valued logic using a reduced lattice:
+
+- `Null`
+- `NonNull`
+- `Boolean(bool)`
+
+That is fine only if every combination that reaches the fallback arm is genuinely equivalent.
+The current code assumes that in the final arm:
+
+```rust
+(left, right) => {
+    debug_assert_eq!(left, right);
+    left
+}
+```
+
+but that assumption is not generally valid once the evaluator is fed real optimizer
+predicates composed from rewrites, wrappers, and partially-supported subexpressions.
+
+The result is:
+
+1. the syntactic evaluator returns `Some(...)` for expressions it cannot fully model safely
+2. `binary_boolean_value(...)` reaches mixed states it did not expect
+3. the new `debug_assert_eq!` panics in debug/test builds during planning
+4. sqllogictest reports that panic against whichever statement was active in that worker task,
+   which explains why the surfaced failures span `aggregate.slt`, `subquery.slt`, and `union.slt`
+   even though those SQL texts are not obviously about this helper
+
+Even without the assertion, returning `left` in that arm would still be unsound, so the
+assert is exposing a real correctness hole in the fast path rather than creating one by itself.
+
+## Why this matches the reported failures
+
+The reported failures are broad and seemingly unrelated:
+
+- aggregate regression query
+- correlated subquery regression query
+- union / except regression query
+
+That pattern fits a planner-time panic in a shared optimizer helper much better than a
+query-specific execution bug.
+
+Within this PR range, the restored syntactic null-restriction path is the most plausible
+shared change that:
+
+- runs during optimization
+- is exercised across many plan shapes
+- added fresh `debug_assert_eq!` assumptions
+- was reintroduced after earlier notes had already identified risk in this area
+
+## Fix plan
+
+1. Make the syntactic evaluator conservative instead of assertive.
+
+   In [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs),
+   change `binary_boolean_value(...)` so that any mixed residual state it cannot prove
+   equivalent returns `None`, not `left`, and never asserts.
+
+   Concretely:
+
+   - remove the `debug_assert_eq!(left, right)` fallback
+   - treat unmodelled combinations as "unknown to syntactic evaluator"
+   - let the caller fall back to `authoritative_restrict_null_predicate(...)`
+
+2. Tighten when `syntactic_restrict_null_predicate(...)` is allowed to return `Some(...)`.
+
+   Audit the current supported nodes in
+   [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs)
+   and make sure it only returns a definitive answer for expressions whose null-substitution
+   result is proven by construction.
+
+   In particular, re-check:
+
+   - `AND` / `OR` combinations of partially-known boolean states
+   - wrappers like `IS [NOT] NULL`
+   - casts / negation / LIKE
+   - interactions where one side is known boolean and the other side is only "non-null"
+
+3. Keep the mixed-reference early return from Sub-issue B.
+
+   Do **not** remove the cheap bailout in
+   [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L135).
+   That early `Ok(false)` for predicates referencing columns outside the join-key set is still
+   the intended optimization and is not the leading suspect for this regression.
+
+4. Do not reopen the earlier semantic fixes.
+
+   Preserve:
+
+   - the caller split between `is_restrict_null_predicate(...)` and `evaluates_to_null(...)`
+   - the scalar-subquery / cross-join `PushDownFilter` fixes already captured in earlier plans
+
+   This fix should stay narrowly focused on the restored syntactic fast path.
+
+5. Add differential regression tests that would have caught this.
+
+   In [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs),
+   extend the fast-path-vs-authoritative tests so they cover **composed boolean predicates**,
+   not just simple scalar cases.
+
+   Add reduced cases for:
+
+   - `AND` with one identity side and one partially-known side
+   - `OR` with one identity side and one partially-known side
+   - combinations involving `IS NULL` / `IS NOT NULL`
+   - CASE-derived boolean predicates if they can now reach the syntactic path indirectly
+
+   The invariant should be:
+
+   - if syntactic evaluation returns `Some(x)`, authoritative evaluation must also return `x`
+   - otherwise syntactic evaluation must return `None`
+
+6. Add one optimizer-level repro derived from real failing traffic.
+
+   Because sqllogictest surfaced the panic through parallel worker tasks, add at least one
+   optimizer-level regression that exercises `PushDownFilter` null-restriction analysis on a
+   realistic composed predicate instead of relying only on hand-picked helper-unit cases.
+
+   Preferred location:
+
+   - [`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs)
+   - or an optimizer integration test if that is easier to minimize
+
+7. Reproduce with backtrace once the workspace compiles cleanly.
+
+   This checkout currently has an unrelated compile blocker in
+   [`datafusion/physical-expr-common/src/metrics/baseline.rs`](/Users/kosiew/GitHub/datafusion/datafusion/physical-expr-common/src/metrics/baseline.rs)
+   (`MetricType::Dev` vs `MetricType::DEV`), which prevented a local sqllogictest backtrace.
+
+   After that is resolved, rerun the smallest useful reproductions first:
+
+   ```bash
+   cargo test -p datafusion-optimizer -- utils
+   cargo test -p datafusion-optimizer -- push_down_filter
+   cargo test -p datafusion-sqllogictest --test sqllogictests -- subquery.slt
+   cargo test -p datafusion-sqllogictest --test sqllogictests -- union.slt
+   cargo test -p datafusion-sqllogictest --test sqllogictests -- aggregate.slt
+   ```
+
+8. Finish with required repo checks.
+
+   ```bash
+   cargo fmt --all
+   cargo clippy --all-targets --all-features -- -D warnings
+   ```
+
+## Expected code touch points
+
+- [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs)
+- [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs)
+- possibly one optimizer regression test in
+  [`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs)
+
+## Short version
+
+The likely regression is that the restored syntactic null-restriction evaluator is no longer
+fully conservative: it reaches mixed boolean states it cannot soundly model, then asserts
+instead of declining and falling back to authoritative evaluation. The fix is to make that
+fast path conservative again: unsupported or mixed states must return `None`, not panic and
+not guess.