fix(gfql): harden optional-arm OR null-safety token detection (#1219) (#1286)

lmeyerov · web-flow · commit c069e8f70844 · 2026-05-03T23:18:22.000-07:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -33,6 +33,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 - **GFQL / Cypher lowering — #1273 acceptance slice admits scalar multi-alias WITH shape A**: First-stage bounded reentry now admits `MATCH ... WITH a.id AS a_id, b.id AS b_id ... MATCH ...` multi-alias scalar/property projections (shape A) while keeping whole-row multi-alias carries (for example `WITH a, b`) outside the admitted lane. The admission gate is explicitly narrowed to scalar/property projections only, so known TCK xfail contract shape `WITH n, x` (`with-where3-3`) remains fail-fast and unchanged pending broader #1273 closure. Tests were converted to positive execution assertions for the admitted shape, and remaining rejected shapes continue to point explicitly to #1273.
 - **GFQL / Cypher row-boolean residual topology amplification (#1219)**: Added two cross-alias OR stress tests in `test_lowering.py` for previously under-covered topology classes: (1) disconnected cartesian-product `MATCH (a:A), (b:B)` with `WHERE a.x = 1 OR b.y = 2`, and (2) broader two-hop fanout `MATCH (a:A)-->(m:M)-->(b:B)` with the same disjunction. These lock row-wise disjunction semantics over wider join/fanout shapes on top of the existing 2x2 cross-alias union test.
 - **GFQL / predicate pushdown safety — quote-aware null-safety classification hardening (#1219)**: `is_null_rejecting()` now masks quoted/backticked segments before scanning for boolean operators and null-safe forms (`IS NULL`, `IS NOT NULL`, `COALESCE`, `NULLIF`). This closes false-safe substring matches like `n.note = 'x is not null'` / `'coalesce('` that could previously be misclassified as null-safe for OPTIONAL-arm pushdown. Added regression coverage in `test_pushdown_safety.py` and pass-level lock in `test_predicate_pushdown_pass.py` to ensure quoted literal text does not trigger null-safe admission.
+- **GFQL / predicate pushdown safety — whitespace-delimited OR detection hardening (#1219)**: `is_null_rejecting()` now normalizes unquoted whitespace and detects boolean operators via word-boundary token checks instead of exact `" and "` / `" or "` substrings. This closes a false-safe gap where newline/tab-delimited OR compounds (for example `n IS NULL\\nOR n.age > 5`) could bypass conservative OR handling and be incorrectly pushed into OPTIONAL arms. Added regression coverage in `test_pushdown_safety.py` plus a pass-level lock in `test_predicate_pushdown_pass.py`.
 - **GFQL / Cypher lowering — runtime carry forwarding through trailing WITH narrowing (#1272, #999 partial)**: Fixed the IC5 sub-A collect/unwind reentry residual where carried scalars could degrade to null/NaN across a trailing `WITH d, bid` stage. Hidden reentry-property references are now excluded from one-source alias guard counting while still forcing bindings-row lowering for stages that reference hidden carry columns, so carry fields survive narrowing and remain readable in subsequent RETURN/ORDER BY. Updated regression coverage by converting the two #1272 failfast locks into positive execution assertions for both plain trailing-WITH and ORDER BY/LIMIT variants.
 - **GFQL / Cypher lowering — residual multi-source row-projection failfast now points to issue #1273**: Remaining unsupported one-MATCH-source row-lowering residuals now raise the same `E108` boundary with an explicit `#1273` pointer so rejected shapes route directly to the tracked closure lane. Updated rejection-lock tests assert the issue pointer is present in emitted validation messages.
 - **GFQL / Cypher lowering — admit simple free-form intermediate MATCH after WITH (#1263, #999 partial)**: Lifts the failfast at the `first_alias != reentry_alias` site of `_compile_bounded_reentry_query` for the conservative case: trailing MATCH whose first alias is not in the prefix WITH's carried whole-row set, with no carried-alias property reference in the trailing scope. Three-commit shape: (1) IR adds `ReentryPlan.free_form: bool = False` to mark the new shape (`graphistry/compute/gfql/cypher/reentry_plan.py`); (2) compile sets `reentry_alias = first_alias` and `non_source_alias_names = whole_row_carried` so the trailing MATCH compiles as a regular fresh-MATCH and the existing property-carry rewriter (#1248) treats every carried alias as non-source (`graphistry/compute/gfql/cypher/lowering.py`); (3) runtime adds `_compiled_query_freeform_reentry_state` in `gfql_unified.py` — for a single-prefix-row, broadcasts every `__cypher_reentry_*` hidden column from the prefix row onto every base node and returns `start_nodes=None` so the suffix runs as a global MATCH and the row pipeline carries the broadcast values through whichever alias the trailing MATCH binds. Compile-time failfast added for the carried-property variant (`free_form` AND any `<carried>.<prop>` ref in trailing scope) — the existing demote (#1071) + property-rewriter (#1248) chain double-wraps the synthesized hidden column name, and closing that case requires a rewrite-order refactor that lands as its own focused follow-up. Runtime-side multi-prefix-row free-form raises a clear failfast pointing at the multi-row follow-up. Tests retargeted: `..._failfast_rejects_intermediate_reentry_match_with_no_carried_source` → `..._failfast_rejects_intermediate_reentry_match_with_carried_property_in_trailing_where`; `..._failfast_rejects_simple_freeform_intermediate_reentry_match` → `..._executes_simple_freeform_intermediate_reentry_match`. Adds cuDF parity test and a multi-prefix-row failfast regression. Three TCK xfail-contract scenarios admitted to `MATCHES_EXPECTED` (`expr-typeconversion2-7`, `expr-typeconversion3-5`, `expr-typeconversion4-7` — all `MATCH (single-node) WITH * MATCH (n) RETURN <typecast>(n.<prop>)` shape) (#1263, #999, #989).
diff --git a/graphistry/compute/gfql/ir/pushdown_safety.py b/graphistry/compute/gfql/ir/pushdown_safety.py
@@ -14,6 +14,7 @@
 """
 from __future__ import annotations
 
+import re
 from typing import FrozenSet, Sequence
 
 from graphistry.compute.gfql.ir.bound_ir import ScopeFrame
@@ -62,6 +63,11 @@ def _mask_quoted_and_backticked(expr: str) -> str:
     return "".join(chars)
 
 
+def _normalize_unquoted_whitespace(expr: str) -> str:
+    """Collapse unquoted whitespace runs for stable token/substr checks."""
+    return re.sub(r"\s+", " ", expr).strip()
+
+
 def is_null_rejecting(
     predicate: BoundPredicate,
     null_extended_aliases: FrozenSet[str],
@@ -96,14 +102,16 @@ def is_null_rejecting(
         return False
     if not predicate.expression:
         return True
-    expr_lower = _mask_quoted_and_backticked(predicate.expression.lower())
+    expr_lower = _normalize_unquoted_whitespace(
+        _mask_quoted_and_backticked(predicate.expression.lower())
+    )
     # AND compounds are always null-rejecting: even if one conjunct is null-safe,
     # the other may not be, and True AND NULL = NULL (row filtered).
-    if " and " in expr_lower:
+    if re.search(r"\band\b", expr_lower):
         return True
     # OR compounds are conservatively treated as null-rejecting; we do not
     # perform disjunct-level alias analysis in this helper.
-    if " or " in expr_lower:
+    if re.search(r"\bor\b", expr_lower):
         return True
     for form in _NULL_SAFE_FORMS:
         if form in expr_lower:
diff --git a/graphistry/tests/compute/gfql/test_predicate_pushdown_pass.py b/graphistry/tests/compute/gfql/test_predicate_pushdown_pass.py
@@ -152,6 +152,31 @@ def test_predicate_pushdown_keeps_or_with_null_safe_subexpression_on_optional_ar
     assert result.predicate.expression == "n IS NOT NULL OR m.flag = 1"
 
 
+def test_predicate_pushdown_keeps_newline_delimited_or_on_optional_arm() -> None:
+    input_scan = NodeScan(op_id=1, label="Person", output_schema=_schema("m"))
+    optional_match = PatternMatch(
+        op_id=2,
+        input=input_scan,
+        pattern={"aliases": ("n",)},
+        optional=True,
+        arm_id="arm1",
+        output_schema=_schema("m", "n"),
+    )
+    plan = Filter(
+        op_id=3,
+        input=optional_match,
+        predicate=_pred("n IS NULL\nOR n.age > 5", frozenset({"n"})),
+        output_schema=_schema("m", "n"),
+    )
+
+    result = PredicatePushdownPass().run(plan, PlanContext()).plan
+
+    assert isinstance(result, Filter)
+    assert isinstance(result.input, PatternMatch)
+    assert result.input.predicates == []
+    assert result.predicate.expression == "n IS NULL\nOR n.age > 5"
+
+
 def test_predicate_pushdown_does_not_treat_null_keywords_inside_string_literals_as_safe() -> None:
     input_scan = NodeScan(op_id=1, label="Person", output_schema=_schema("m"))
     optional_match = PatternMatch(
diff --git a/graphistry/tests/compute/gfql/test_pushdown_safety.py b/graphistry/tests/compute/gfql/test_pushdown_safety.py
@@ -127,6 +127,18 @@ def test_or_without_null_safe_form_is_rejecting(self):
             frozenset({"n"}),
         )
 
+    def test_or_with_newline_delimiter_is_rejecting(self):
+        assert is_null_rejecting(
+            _pred("n IS NULL\nOR n.age > 5", frozenset({"n"})),
+            frozenset({"n"}),
+        )
+
+    def test_or_with_tab_delimiter_is_rejecting(self):
+        assert is_null_rejecting(
+            _pred("n IS NULL\tOR\tn.age > 5", frozenset({"n"})),
+            frozenset({"n"}),
+        )
+
     # --- property-level IS NULL --- #
 
     def test_property_is_null_is_not_rejecting(self):