feat(gfql/polars): native toFloat, collect/collect_distinct, WHERE IN — NIE->native

lmeyerov · claude · lmeyerov · commit 1609331ec76b · 2026-06-30T18:14:44.000-07:00
Three previously-NIE cypher row surfaces now run natively on engine='polars',
parity-validated vs the pandas oracle across pandas/cudf/polars/polars-gpu:

- toFloat(x): int/uint/bool/float -&gt; Float64 (NaN preserved; no fillna step,
  unlike toInteger — float64 has no null sentinel). Non-numeric String declines
  (NIE) because pandas astype(float) RAISES, not null-on-failure.
- collect(x) / collect(DISTINCT x) aggregations complete the native group_by
  surface: drop nulls, preserve within-group first-occurrence order (collect
  keeps dups, DISTINCT dedups keep-first), all-null group -&gt; []. drop_nulls()
  /unique(maintain_order=True), no .implode().
- where_rows / WHERE ... IN [list] membership -&gt; is_in (null cell excluded, 3VL).

Removed the stale tofloat conformance-ledger waiver; +tofloat matrix cases,
tofloat-string-NIE test, collect parity test. Validated on dgx: 4242 gfql tests
pass, conformance matrix+ledger+row-pipeline green, 4-engine parity, mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,6 +16,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 - **GFQL lazy Polars engine + GPU target (`engine='polars-gpu'`, cudf_polars)**: The Polars traversal engine now builds a single deferred `pl.LazyFrame` plan per single-hop and materializes `out_edges`+`out_nodes` in ONE `collect_all` on a chosen **execution target** (CPU or GPU). `engine='polars-gpu'` (`Engine.POLARS_GPU`, explicit opt-in only — AUTO never selects it) runs that same lazy plan on the RAPIDS cudf_polars backend (`pl.GPUEngine(raise_on_fail=True)` — NO-CHEATING: a GPU-incapable plan node **raises** rather than silently running on CPU and being reported as a GPU result; see Fixed). The collect-once design is what makes GPU pay off: a benchmark showed per-op eager GPU collect was a *regression* (repeated H2D), while collect-once is a **2.84× single-hop GPU win @1M** with CPU parity. Frames stay `pl.DataFrame` (handled like `POLARS` everywhere); the target is carried by a context var set at the chain/hop dispatch boundary, so `engine='polars'` (CPU) is byte-for-byte unchanged. Validated by differential parity `engine='polars-gpu' == engine='polars'` across the cypher conformance corpus + traversals (`test_engine_polars_gpu.py`, skips when no cudf_polars/GPU). Multi-hop and the chain forward/backward fusion (where the GPU win currently dilutes) are follow-up optimizations.
 
 ### Added
+- **GFQL native Polars engine — more cypher row coverage (`toFloat`, `collect`/`collect(DISTINCT)`, `WHERE … IN`)**: three surfaces that previously raised `NotImplementedError` on `engine='polars'` now run natively, parity-validated vs the pandas oracle across all four engines (and honest-NIE where pandas can't be matched). **`toFloat(x)`** lowers int/uint/bool/float → `Float64` (NaN preserved — float64 has no separate null sentinel, unlike `toInteger`); a non-numeric String declines (NIE) because pandas `astype(float)` *raises* rather than null-on-failure. **`collect(x)` / `collect(DISTINCT x)`** aggregations complete the native `group_by` surface (every other agg was already native): drop nulls, preserve within-group first-occurrence order (`collect` keeps dups; `DISTINCT` dedups keep-first), all-null group → `[]`. **`where_rows`/`WHERE … IN [list]`** membership lowers to `is_in` (a null cell is excluded per openCypher 3VL). No change to any already-native path.
 - **GFQL Polars-CPU streaming collect (opt-in, large traversals)**: `GFQL_POLARS_CPU_STREAMING=1` runs the polars-CPU lazy collects (`hop`/`chain`) on the polars **streaming** executor instead of the default in-memory collect. Benchmarked ~1.04–1.11× faster on big multi-hop traversals (10M nodes / 80M edges: 20.0→18.0 s) and parity-identical, but ~0.86× (slower) on small/interactive sizes (streaming overhead) — so it is **opt-in, default off** (no change to default behavior). Use for large batch traversals where CPU is the target.
 
 ### Fixed
diff --git a/graphistry/compute/gfql/lazy/engine/polars/row_pipeline.py b/graphistry/compute/gfql/lazy/engine/polars/row_pipeline.py
@@ -190,6 +190,21 @@ def _lower_function(node: Any, columns: Sequence[str]) -> Optional[Any]:
         # String: pandas astype(float) RAISES on non-numeric content (NOT null-on-failure),
         # which polars strict=False would silently turn into nulls — a divergence. DECLINE (NIE).
         return None
+    if name == "tofloat" and len(args) == 1:
+        import polars as pl
+        # cypher toFloat: pandas oracle = inner.astype(float) with the isna() null_mask restored
+        # via .where(~mask, pd.NA). CRUCIALLY there is NO .fillna(0)/int step (contrast toInteger):
+        # float64 has no separate null sentinel, so an isna()-masked NaN re-materializes as NaN —
+        # NaN is PRESERVED, not nulled. A plain cast preserves both NaN and null, so NO explicit
+        # NaN mask is needed. Admit only dtypes whose pandas astype(float) polars reproduces:
+        dt = _expr_output_dtype(args[0])
+        if _dtype_is_int(dt) or dt == pl.Boolean or _dtype_is_float(dt):
+            # Int/UInt/Bool/Float -> Float64: exact IEEE widening (bool True/False -> 1.0/0.0;
+            # nulls preserved; NaN preserved). Matches inner.astype(float) on pandas.
+            return args[0].cast(pl.Float64)
+        # String: pandas astype(float) RAISES on non-numeric content (data-dependent, NOT
+        # null-on-failure); polars strict=False would silently null -> divergence. DECLINE (NIE).
+        return None
     if name == "toboolean" and len(args) == 1:
         import polars as pl
         # cypher toBoolean: the pandas oracle parses a fixed token set ("true"/"t"/"1"/"yes" vs
@@ -663,9 +678,15 @@ def where_rows_polars(
     preds: List[Any] = []
     if filter_dict:
         for col, val in filter_dict.items():
-            if col not in columns or isinstance(val, (list, tuple, set, dict)):
-                return None  # missing column / IN-list etc. -> defer (NIE)
-            preds.append(pl.col(col) == val)
+            if col not in columns or isinstance(val, dict):
+                return None  # missing column / nested-struct value -> defer (NIE)
+            if isinstance(val, (list, tuple, set)):
+                # membership / IN: polars `is_in` over a null cell yields null -> filter drops it,
+                # i.e. openCypher 3VL (`null IN [...]` = null -> excluded), matching the filter_by_dict
+                # membership fix. (Equality below also drops nulls: `null == v` -> null -> dropped.)
+                preds.append(pl.col(col).is_in(list(val)))
+            else:
+                preds.append(pl.col(col) == val)
     if expr is not None:
         if not isinstance(expr, str):
             return None
@@ -694,8 +715,8 @@ def order_by_polars(g: Plottable, keys: Sequence[Any]) -> Optional[Plottable]:
     return _rewrap(g, table.sort(exprs, descending=descending, nulls_last=True))
 
 
-# Aggregation funcs lowered to native polars; collect/collect_distinct/stdev/
-# percentile etc. return None → caller declines (NIE, no pandas bridge).
+# Aggregation funcs lowered to native polars (count/sum/avg/min/max/count_distinct/collect/
+# collect_distinct); stdev/percentile etc. return None → caller declines (NIE, no pandas bridge).
 def _agg_expr(func: str, expr: Optional[str], columns: Sequence[str], alias: str) -> Optional[Any]:
     import polars as pl
     func = func.lower()
@@ -718,6 +739,19 @@ def _agg_expr(func: str, expr: Optional[str], columns: Sequence[str], alias: str
         # cypher count(DISTINCT x) drops nulls (pandas nunique(dropna=True)); polars n_unique()
         # counts null as a value, so drop nulls first for parity.
         return col.drop_nulls().n_unique().alias(alias)
+    if func == "collect":
+        # cypher collect(x) DROPS nulls and preserves within-group row order (pandas
+        # row/pipeline.py:4552-4582 filters ~isna() then agg(list)). In a polars
+        # group_by(maintain_order=True).agg, a multi-valued expr yields a List column, so
+        # drop_nulls() alone reproduces it; an all-null/empty group yields [] (an empty list),
+        # never [null] — matching the oracle's []-coercion (4597-4614). NO .implode() (that would
+        # double-wrap to List(List)).
+        return col.drop_nulls().alias(alias)
+    if func == "collect_distinct":
+        # collect(DISTINCT x): drop nulls, dedup keep-first preserving first-occurrence order
+        # (pandas drop_duplicates(keep="first") + agg(list)). polars unique(maintain_order=True)
+        # is keep-first order-preserving; empty/all-null group -> [].
+        return col.drop_nulls().unique(maintain_order=True).alias(alias)
     return None
 
 
diff --git a/graphistry/tests/compute/gfql/test_conformance_ledger.py b/graphistry/tests/compute/gfql/test_conformance_ledger.py
@@ -176,7 +176,6 @@ def test_known_uncovered_reasons_are_nonempty():
 # honest one-liner (all currently honest-NIE-or-unasserted; none has a dedicated test that the
 # parser misses, unlike the predicate temporal entries).
 KNOWN_UNCOVERED_FUNCTIONS: dict[str, str] = {
-    "tofloat": "pandas-native (astype float); polars _lower_function has NO branch -> honest NIE, not yet asserted. TODO: add a tofloat native-or-NIE case.",
     "keys": "map/entity key-extraction; polars declines (no _lower_function branch) -> NIE; not yet asserted. TODO.",
     "labels": "node-label text function; polars declines -> NIE; not yet asserted. TODO.",
     "type": "edge-type function; polars declines -> NIE; not yet asserted. TODO.",
diff --git a/graphistry/tests/compute/gfql/test_engine_polars_conformance_matrix.py b/graphistry/tests/compute/gfql/test_engine_polars_conformance_matrix.py
@@ -263,6 +263,9 @@ def _cypher_expression_queries():
         ("tointeger_int", "MATCH (n) RETURN n.id AS id, toInteger(n.num) AS i"),
         ("tointeger_float", "MATCH (n) RETURN n.id AS id, toInteger(n.f) AS i"),
         ("tointeger_bool", "MATCH (n) RETURN n.id AS id, toInteger(n.flag) AS i"),
+        ("tofloat_int", "MATCH (n) RETURN n.id AS id, toFloat(n.num) AS f"),
+        ("tofloat_float", "MATCH (n) RETURN n.id AS id, toFloat(n.f) AS f"),
+        ("tofloat_bool", "MATCH (n) RETURN n.id AS id, toFloat(n.flag) AS f"),
         ("toboolean_bool", "MATCH (n) RETURN n.id AS id, toBoolean(n.flag) AS b"),
         ("tostring_bool", "MATCH (n) RETURN n.id AS id, toString(n.flag) AS s"),
         ("tostring_int", "MATCH (n) RETURN n.id AS id, toString(n.num) AS s"),
@@ -306,6 +309,41 @@ def test_substring_runs_natively_on_polars():
     assert res == base, f"substring polars must match pandas oracle: {res} != {base}"
 
 
+def test_tofloat_string_honest_nie_polars():
+    """toFloat(<non-numeric String>) RAISES on the pandas oracle (astype(float) fails) — NOT
+    null-on-failure. polars MUST decline with an honest NIE rather than fabricate strict=False
+    nulls. (Int/Float/Bool toFloat is native + parity — covered by the matrix tofloat_* cases.)"""
+    if "polars" not in _NONPANDAS_ENGINES:
+        pytest.skip("polars not installed")
+    g = _graph(4)
+    q = "MATCH (n) RETURN n.id AS id, toFloat(n.name) AS f"
+    assert _run(g, q, "pandas")[0] != "ok", "pandas toFloat(non-numeric string) raises (not null-on-failure)"
+    assert _run(g, q, "polars")[0] == "nie", "string toFloat must be an honest NIE on polars (no strict=False null fabrication)"
+
+
+def test_collect_aggregations_native_parity_polars():
+    """collect(x) / collect(DISTINCT x) lower NATIVELY on polars and match pandas: drop nulls,
+    preserve within-group order (collect keeps dups; distinct dedups keep-first), all-null group
+    -> []. List cells normalized to python lists for the cross-engine compare."""
+    if "polars" not in _NONPANDAS_ENGINES:
+        pytest.skip("polars not installed")
+    import pandas as pd
+    e = pd.DataFrame({"s": [0, 1, 2, 3, 4, 5], "d": [1, 2, 3, 4, 5, 6],
+                      "k": ["a", "a", "a", "b", "b", "c"], "v": ["x", "w", None, "y", "y", None]})
+    g = graphistry.edges(e, "s", "d")
+
+    def collected(engine, q):
+        df = _to_pd(g.gfql(q, engine=engine)._nodes).sort_values("k")
+        return {r["k"]: list(r["vs"]) for _, r in df.iterrows()}
+
+    for q, expected in [
+        ("MATCH ()-[r]->() RETURN r.k AS k, collect(r.v) AS vs", {"a": ["x", "w"], "b": ["y", "y"], "c": []}),
+        ("MATCH ()-[r]->() RETURN r.k AS k, collect(DISTINCT r.v) AS vs", {"a": ["x", "w"], "b": ["y"], "c": []}),
+    ]:
+        assert collected("pandas", q) == expected
+        assert collected("polars", q) == expected
+
+
 def test_size_list_runs_natively_on_polars():
     """size(<List column>) MUST lower NATIVELY (list.len). The List column is built by
     the already-native list-literal with_ so the operand dtype is List, and size is an