|
| 1 | +# FIX 10: make restored syntactic null-restriction fast path conservative again |
| 2 | + |
| 3 | +## What this plan is for |
| 4 | + |
| 5 | +This plan is specifically for the new sqllogictest panics reported after the later |
| 6 | +commits in `b9828cabc^..a565c732b`. |
| 7 | + |
| 8 | +It is **not** a retry of earlier fixes: |
| 9 | + |
| 10 | +- `FIX_05.md` / `FIX_06.md`: scalar-subquery / cross-join filter-promotion regressions in |
| 11 | + `push_down_filter.rs` |
| 12 | +- `FIX_07.md`: semantic split between `is_restrict_null_predicate` and `evaluates_to_null` |
| 13 | +- `FIX_08.md`: benchmark drift and over-broad production work in the benchmark branch |
| 14 | +- `FIX_09.md`: restoring the syntactic fast path after it had been removed |
| 15 | + |
| 16 | +Those threads should stay fixed. The new failure appears to come from the **restored** |
| 17 | +syntactic evaluator itself, not from the mixed-reference early return, not from |
| 18 | +`evaluates_to_null`, and not from scalar-subquery join-promotion logic. |
| 19 | + |
| 20 | +## Best current root-cause hypothesis |
| 21 | + |
| 22 | +The most likely culprit is the restored syntactic fast path in |
| 23 | +[`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs), |
| 24 | +re-entered from [`is_restrict_null_predicate`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs). |
| 25 | + |
| 26 | +The key problem is here: |
| 27 | + |
| 28 | +- [`datafusion/optimizer/src/utils/null_restriction.rs:65`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L65) |
| 29 | +- [`datafusion/optimizer/src/utils/null_restriction.rs:88`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs#L88) |
| 30 | +- [`datafusion/optimizer/src/utils.rs:152`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L152) |
| 31 | +- [`datafusion/optimizer/src/utils.rs:161`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L161) |
| 32 | + |
| 33 | +`binary_boolean_value(...)` models SQL three-valued logic using a reduced lattice: |
| 34 | + |
| 35 | +- `Null` |
| 36 | +- `NonNull` |
| 37 | +- `Boolean(bool)` |
| 38 | + |
| 39 | +That is fine only if every combination that reaches the fallback arm is genuinely equivalent. |
| 40 | +The current code assumes that in the final arm: |
| 41 | + |
| 42 | +```rust |
| 43 | +(left, right) => { |
| 44 | + debug_assert_eq!(left, right); |
| 45 | + left |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +but that assumption is not generally valid once the evaluator is fed real optimizer |
| 50 | +predicates composed from rewrites, wrappers, and partially-supported subexpressions. |
| 51 | + |
| 52 | +The result is: |
| 53 | + |
| 54 | +1. the syntactic evaluator returns `Some(...)` for expressions it cannot fully model safely |
| 55 | +2. `binary_boolean_value(...)` reaches mixed states it did not expect |
| 56 | +3. the new `debug_assert_eq!` panics in debug/test builds during planning |
| 57 | +4. sqllogictest reports that panic against whichever statement was active in that worker task, |
| 58 | + which explains why the surfaced failures span `aggregate.slt`, `subquery.slt`, and `union.slt` |
| 59 | + even though those SQL texts are not obviously about this helper |
| 60 | + |
| 61 | +Even without the assertion, returning `left` in that arm would still be unsound, so the |
| 62 | +assert is exposing a real correctness hole in the fast path rather than creating one by itself. |
| 63 | + |
| 64 | +## Why this matches the reported failures |
| 65 | + |
| 66 | +The reported failures are broad and seemingly unrelated: |
| 67 | + |
| 68 | +- aggregate regression query |
| 69 | +- correlated subquery regression query |
| 70 | +- union / except regression query |
| 71 | + |
| 72 | +That pattern fits a planner-time panic in a shared optimizer helper much better than a |
| 73 | +query-specific execution bug. |
| 74 | + |
| 75 | +Within this PR range, the restored syntactic null-restriction path is the most plausible |
| 76 | +shared change that: |
| 77 | + |
| 78 | +- runs during optimization |
| 79 | +- is exercised across many plan shapes |
| 80 | +- added fresh `debug_assert_eq!` assumptions |
| 81 | +- was reintroduced after earlier notes had already identified risk in this area |
| 82 | + |
| 83 | +## Fix plan |
| 84 | + |
| 85 | +1. Make the syntactic evaluator conservative instead of assertive. |
| 86 | + |
| 87 | + In [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs), |
| 88 | + change `binary_boolean_value(...)` so that any mixed residual state it cannot prove |
| 89 | + equivalent returns `None`, not `left`, and never asserts. |
| 90 | + |
| 91 | + Concretely: |
| 92 | + |
| 93 | + - remove the `debug_assert_eq!(left, right)` fallback |
| 94 | + - treat unmodelled combinations as "unknown to syntactic evaluator" |
| 95 | + - let the caller fall back to `authoritative_restrict_null_predicate(...)` |
| 96 | + |
| 97 | +2. Tighten when `syntactic_restrict_null_predicate(...)` is allowed to return `Some(...)`. |
| 98 | + |
| 99 | + Audit the current supported nodes in |
| 100 | + [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs) |
| 101 | + and make sure it only returns a definitive answer for expressions whose null-substitution |
| 102 | + result is proven by construction. |
| 103 | + |
| 104 | + In particular, re-check: |
| 105 | + |
| 106 | + - `AND` / `OR` combinations of partially-known boolean states |
| 107 | + - wrappers like `IS [NOT] NULL` |
| 108 | + - casts / negation / LIKE |
| 109 | + - interactions where one side is known boolean and the other side is only "non-null" |
| 110 | + |
| 111 | +3. Keep the mixed-reference early return from Sub-issue B. |
| 112 | + |
| 113 | + Do **not** remove the cheap bailout in |
| 114 | + [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs#L135). |
| 115 | + That early `Ok(false)` for predicates referencing columns outside the join-key set is still |
| 116 | + the intended optimization and is not the leading suspect for this regression. |
| 117 | + |
| 118 | +4. Do not reopen the earlier semantic fixes. |
| 119 | + |
| 120 | + Preserve: |
| 121 | + |
| 122 | + - the caller split between `is_restrict_null_predicate(...)` and `evaluates_to_null(...)` |
| 123 | + - the scalar-subquery / cross-join `PushDownFilter` fixes already captured in earlier plans |
| 124 | + |
| 125 | + This fix should stay narrowly focused on the restored syntactic fast path. |
| 126 | + |
| 127 | +5. Add differential regression tests that would have caught this. |
| 128 | + |
| 129 | + In [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs), |
| 130 | + extend the fast-path-vs-authoritative tests so they cover **composed boolean predicates**, |
| 131 | + not just simple scalar cases. |
| 132 | + |
| 133 | + Add reduced cases for: |
| 134 | + |
| 135 | + - `AND` with one identity side and one partially-known side |
| 136 | + - `OR` with one identity side and one partially-known side |
| 137 | + - combinations involving `IS NULL` / `IS NOT NULL` |
| 138 | + - CASE-derived boolean predicates if they can now reach the syntactic path indirectly |
| 139 | + |
| 140 | + The invariant should be: |
| 141 | + |
| 142 | + - if syntactic evaluation returns `Some(x)`, authoritative evaluation must also return `x` |
| 143 | + - otherwise syntactic evaluation must return `None` |
| 144 | + |
| 145 | +6. Add one optimizer-level repro derived from real failing traffic. |
| 146 | + |
| 147 | + Because sqllogictest surfaced the panic through parallel worker tasks, add at least one |
| 148 | + optimizer-level regression that exercises `PushDownFilter` null-restriction analysis on a |
| 149 | + realistic composed predicate instead of relying only on hand-picked helper-unit cases. |
| 150 | + |
| 151 | + Preferred location: |
| 152 | + |
| 153 | + - [`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs) |
| 154 | + - or an optimizer integration test if that is easier to minimize |
| 155 | + |
| 156 | +7. Reproduce with backtrace once the workspace compiles cleanly. |
| 157 | + |
| 158 | + This checkout currently has an unrelated compile blocker in |
| 159 | + [`datafusion/physical-expr-common/src/metrics/baseline.rs`](/Users/kosiew/GitHub/datafusion/datafusion/physical-expr-common/src/metrics/baseline.rs) |
| 160 | + (`MetricType::Dev` vs `MetricType::DEV`), which prevented a local sqllogictest backtrace. |
| 161 | + |
| 162 | + After that is resolved, rerun the smallest useful reproductions first: |
| 163 | + |
| 164 | + ```bash |
| 165 | + cargo test -p datafusion-optimizer -- utils |
| 166 | + cargo test -p datafusion-optimizer -- push_down_filter |
| 167 | + cargo test -p datafusion-sqllogictest --test sqllogictests -- subquery.slt |
| 168 | + cargo test -p datafusion-sqllogictest --test sqllogictests -- union.slt |
| 169 | + cargo test -p datafusion-sqllogictest --test sqllogictests -- aggregate.slt |
| 170 | + ``` |
| 171 | + |
| 172 | +8. Finish with required repo checks. |
| 173 | + |
| 174 | + ```bash |
| 175 | + cargo fmt --all |
| 176 | + cargo clippy --all-targets --all-features -- -D warnings |
| 177 | + ``` |
| 178 | + |
| 179 | +## Expected code touch points |
| 180 | + |
| 181 | +- [`datafusion/optimizer/src/utils/null_restriction.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils/null_restriction.rs) |
| 182 | +- [`datafusion/optimizer/src/utils.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/utils.rs) |
| 183 | +- possibly one optimizer regression test in |
| 184 | + [`datafusion/optimizer/src/push_down_filter.rs`](/Users/kosiew/GitHub/datafusion/datafusion/optimizer/src/push_down_filter.rs) |
| 185 | + |
| 186 | +## Short version |
| 187 | + |
| 188 | +The likely regression is that the restored syntactic null-restriction evaluator is no longer |
| 189 | +fully conservative: it reaches mixed boolean states it cannot soundly model, then asserts |
| 190 | +instead of declining and falling back to authoritative evaluation. The fix is to make that |
| 191 | +fast path conservative again: unsupported or mixed states must return `None`, not panic and |
| 192 | +not guess. |
0 commit comments