Skip to content

Commit 1609331

Browse files
lmeyerovclaude
andcommitted
feat(gfql/polars): native toFloat, collect/collect_distinct, WHERE IN — NIE->native
Three previously-NIE cypher row surfaces now run natively on engine='polars', parity-validated vs the pandas oracle across pandas/cudf/polars/polars-gpu: - toFloat(x): int/uint/bool/float -> Float64 (NaN preserved; no fillna step, unlike toInteger — float64 has no null sentinel). Non-numeric String declines (NIE) because pandas astype(float) RAISES, not null-on-failure. - collect(x) / collect(DISTINCT x) aggregations complete the native group_by surface: drop nulls, preserve within-group first-occurrence order (collect keeps dups, DISTINCT dedups keep-first), all-null group -> []. drop_nulls() /unique(maintain_order=True), no .implode(). - where_rows / WHERE ... IN [list] membership -> is_in (null cell excluded, 3VL). Removed the stale tofloat conformance-ledger waiver; +tofloat matrix cases, tofloat-string-NIE test, collect parity test. Validated on dgx: 4242 gfql tests pass, conformance matrix+ledger+row-pipeline green, 4-engine parity, mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent e295ddf commit 1609331

4 files changed

Lines changed: 78 additions & 6 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
1616
- **GFQL lazy Polars engine + GPU target (`engine='polars-gpu'`, cudf_polars)**: The Polars traversal engine now builds a single deferred `pl.LazyFrame` plan per single-hop and materializes `out_edges`+`out_nodes` in ONE `collect_all` on a chosen **execution target** (CPU or GPU). `engine='polars-gpu'` (`Engine.POLARS_GPU`, explicit opt-in only — AUTO never selects it) runs that same lazy plan on the RAPIDS cudf_polars backend (`pl.GPUEngine(raise_on_fail=True)` — NO-CHEATING: a GPU-incapable plan node **raises** rather than silently running on CPU and being reported as a GPU result; see Fixed). The collect-once design is what makes GPU pay off: a benchmark showed per-op eager GPU collect was a *regression* (repeated H2D), while collect-once is a **2.84× single-hop GPU win @1M** with CPU parity. Frames stay `pl.DataFrame` (handled like `POLARS` everywhere); the target is carried by a context var set at the chain/hop dispatch boundary, so `engine='polars'` (CPU) is byte-for-byte unchanged. Validated by differential parity `engine='polars-gpu' == engine='polars'` across the cypher conformance corpus + traversals (`test_engine_polars_gpu.py`, skips when no cudf_polars/GPU). Multi-hop and the chain forward/backward fusion (where the GPU win currently dilutes) are follow-up optimizations.
1717

1818
### Added
19+
- **GFQL native Polars engine — more cypher row coverage (`toFloat`, `collect`/`collect(DISTINCT)`, `WHERE … IN`)**: three surfaces that previously raised `NotImplementedError` on `engine='polars'` now run natively, parity-validated vs the pandas oracle across all four engines (and honest-NIE where pandas can't be matched). **`toFloat(x)`** lowers int/uint/bool/float → `Float64` (NaN preserved — float64 has no separate null sentinel, unlike `toInteger`); a non-numeric String declines (NIE) because pandas `astype(float)` *raises* rather than null-on-failure. **`collect(x)` / `collect(DISTINCT x)`** aggregations complete the native `group_by` surface (every other agg was already native): drop nulls, preserve within-group first-occurrence order (`collect` keeps dups; `DISTINCT` dedups keep-first), all-null group → `[]`. **`where_rows`/`WHERE … IN [list]`** membership lowers to `is_in` (a null cell is excluded per openCypher 3VL). No change to any already-native path.
1920
- **GFQL Polars-CPU streaming collect (opt-in, large traversals)**: `GFQL_POLARS_CPU_STREAMING=1` runs the polars-CPU lazy collects (`hop`/`chain`) on the polars **streaming** executor instead of the default in-memory collect. Benchmarked ~1.04–1.11× faster on big multi-hop traversals (10M nodes / 80M edges: 20.0→18.0 s) and parity-identical, but ~0.86× (slower) on small/interactive sizes (streaming overhead) — so it is **opt-in, default off** (no change to default behavior). Use for large batch traversals where CPU is the target.
2021

2122
### Fixed

graphistry/compute/gfql/lazy/engine/polars/row_pipeline.py

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,21 @@ def _lower_function(node: Any, columns: Sequence[str]) -> Optional[Any]:
190190
# String: pandas astype(float) RAISES on non-numeric content (NOT null-on-failure),
191191
# which polars strict=False would silently turn into nulls — a divergence. DECLINE (NIE).
192192
return None
193+
if name == "tofloat" and len(args) == 1:
194+
import polars as pl
195+
# cypher toFloat: pandas oracle = inner.astype(float) with the isna() null_mask restored
196+
# via .where(~mask, pd.NA). CRUCIALLY there is NO .fillna(0)/int step (contrast toInteger):
197+
# float64 has no separate null sentinel, so an isna()-masked NaN re-materializes as NaN —
198+
# NaN is PRESERVED, not nulled. A plain cast preserves both NaN and null, so NO explicit
199+
# NaN mask is needed. Admit only dtypes whose pandas astype(float) polars reproduces:
200+
dt = _expr_output_dtype(args[0])
201+
if _dtype_is_int(dt) or dt == pl.Boolean or _dtype_is_float(dt):
202+
# Int/UInt/Bool/Float -> Float64: exact IEEE widening (bool True/False -> 1.0/0.0;
203+
# nulls preserved; NaN preserved). Matches inner.astype(float) on pandas.
204+
return args[0].cast(pl.Float64)
205+
# String: pandas astype(float) RAISES on non-numeric content (data-dependent, NOT
206+
# null-on-failure); polars strict=False would silently null -> divergence. DECLINE (NIE).
207+
return None
193208
if name == "toboolean" and len(args) == 1:
194209
import polars as pl
195210
# cypher toBoolean: the pandas oracle parses a fixed token set ("true"/"t"/"1"/"yes" vs
@@ -663,9 +678,15 @@ def where_rows_polars(
663678
preds: List[Any] = []
664679
if filter_dict:
665680
for col, val in filter_dict.items():
666-
if col not in columns or isinstance(val, (list, tuple, set, dict)):
667-
return None # missing column / IN-list etc. -> defer (NIE)
668-
preds.append(pl.col(col) == val)
681+
if col not in columns or isinstance(val, dict):
682+
return None # missing column / nested-struct value -> defer (NIE)
683+
if isinstance(val, (list, tuple, set)):
684+
# membership / IN: polars `is_in` over a null cell yields null -> filter drops it,
685+
# i.e. openCypher 3VL (`null IN [...]` = null -> excluded), matching the filter_by_dict
686+
# membership fix. (Equality below also drops nulls: `null == v` -> null -> dropped.)
687+
preds.append(pl.col(col).is_in(list(val)))
688+
else:
689+
preds.append(pl.col(col) == val)
669690
if expr is not None:
670691
if not isinstance(expr, str):
671692
return None
@@ -694,8 +715,8 @@ def order_by_polars(g: Plottable, keys: Sequence[Any]) -> Optional[Plottable]:
694715
return _rewrap(g, table.sort(exprs, descending=descending, nulls_last=True))
695716

696717

697-
# Aggregation funcs lowered to native polars; collect/collect_distinct/stdev/
698-
# percentile etc. return None → caller declines (NIE, no pandas bridge).
718+
# Aggregation funcs lowered to native polars (count/sum/avg/min/max/count_distinct/collect/
719+
# collect_distinct); stdev/percentile etc. return None → caller declines (NIE, no pandas bridge).
699720
def _agg_expr(func: str, expr: Optional[str], columns: Sequence[str], alias: str) -> Optional[Any]:
700721
import polars as pl
701722
func = func.lower()
@@ -718,6 +739,19 @@ def _agg_expr(func: str, expr: Optional[str], columns: Sequence[str], alias: str
718739
# cypher count(DISTINCT x) drops nulls (pandas nunique(dropna=True)); polars n_unique()
719740
# counts null as a value, so drop nulls first for parity.
720741
return col.drop_nulls().n_unique().alias(alias)
742+
if func == "collect":
743+
# cypher collect(x) DROPS nulls and preserves within-group row order (pandas
744+
# row/pipeline.py:4552-4582 filters ~isna() then agg(list)). In a polars
745+
# group_by(maintain_order=True).agg, a multi-valued expr yields a List column, so
746+
# drop_nulls() alone reproduces it; an all-null/empty group yields [] (an empty list),
747+
# never [null] — matching the oracle's []-coercion (4597-4614). NO .implode() (that would
748+
# double-wrap to List(List)).
749+
return col.drop_nulls().alias(alias)
750+
if func == "collect_distinct":
751+
# collect(DISTINCT x): drop nulls, dedup keep-first preserving first-occurrence order
752+
# (pandas drop_duplicates(keep="first") + agg(list)). polars unique(maintain_order=True)
753+
# is keep-first order-preserving; empty/all-null group -> [].
754+
return col.drop_nulls().unique(maintain_order=True).alias(alias)
721755
return None
722756

723757

graphistry/tests/compute/gfql/test_conformance_ledger.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,6 @@ def test_known_uncovered_reasons_are_nonempty():
176176
# honest one-liner (all currently honest-NIE-or-unasserted; none has a dedicated test that the
177177
# parser misses, unlike the predicate temporal entries).
178178
KNOWN_UNCOVERED_FUNCTIONS: dict[str, str] = {
179-
"tofloat": "pandas-native (astype float); polars _lower_function has NO branch -> honest NIE, not yet asserted. TODO: add a tofloat native-or-NIE case.",
180179
"keys": "map/entity key-extraction; polars declines (no _lower_function branch) -> NIE; not yet asserted. TODO.",
181180
"labels": "node-label text function; polars declines -> NIE; not yet asserted. TODO.",
182181
"type": "edge-type function; polars declines -> NIE; not yet asserted. TODO.",

graphistry/tests/compute/gfql/test_engine_polars_conformance_matrix.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,9 @@ def _cypher_expression_queries():
263263
("tointeger_int", "MATCH (n) RETURN n.id AS id, toInteger(n.num) AS i"),
264264
("tointeger_float", "MATCH (n) RETURN n.id AS id, toInteger(n.f) AS i"),
265265
("tointeger_bool", "MATCH (n) RETURN n.id AS id, toInteger(n.flag) AS i"),
266+
("tofloat_int", "MATCH (n) RETURN n.id AS id, toFloat(n.num) AS f"),
267+
("tofloat_float", "MATCH (n) RETURN n.id AS id, toFloat(n.f) AS f"),
268+
("tofloat_bool", "MATCH (n) RETURN n.id AS id, toFloat(n.flag) AS f"),
266269
("toboolean_bool", "MATCH (n) RETURN n.id AS id, toBoolean(n.flag) AS b"),
267270
("tostring_bool", "MATCH (n) RETURN n.id AS id, toString(n.flag) AS s"),
268271
("tostring_int", "MATCH (n) RETURN n.id AS id, toString(n.num) AS s"),
@@ -306,6 +309,41 @@ def test_substring_runs_natively_on_polars():
306309
assert res == base, f"substring polars must match pandas oracle: {res} != {base}"
307310

308311

312+
def test_tofloat_string_honest_nie_polars():
313+
"""toFloat(<non-numeric String>) RAISES on the pandas oracle (astype(float) fails) — NOT
314+
null-on-failure. polars MUST decline with an honest NIE rather than fabricate strict=False
315+
nulls. (Int/Float/Bool toFloat is native + parity — covered by the matrix tofloat_* cases.)"""
316+
if "polars" not in _NONPANDAS_ENGINES:
317+
pytest.skip("polars not installed")
318+
g = _graph(4)
319+
q = "MATCH (n) RETURN n.id AS id, toFloat(n.name) AS f"
320+
assert _run(g, q, "pandas")[0] != "ok", "pandas toFloat(non-numeric string) raises (not null-on-failure)"
321+
assert _run(g, q, "polars")[0] == "nie", "string toFloat must be an honest NIE on polars (no strict=False null fabrication)"
322+
323+
324+
def test_collect_aggregations_native_parity_polars():
325+
"""collect(x) / collect(DISTINCT x) lower NATIVELY on polars and match pandas: drop nulls,
326+
preserve within-group order (collect keeps dups; distinct dedups keep-first), all-null group
327+
-> []. List cells normalized to python lists for the cross-engine compare."""
328+
if "polars" not in _NONPANDAS_ENGINES:
329+
pytest.skip("polars not installed")
330+
import pandas as pd
331+
e = pd.DataFrame({"s": [0, 1, 2, 3, 4, 5], "d": [1, 2, 3, 4, 5, 6],
332+
"k": ["a", "a", "a", "b", "b", "c"], "v": ["x", "w", None, "y", "y", None]})
333+
g = graphistry.edges(e, "s", "d")
334+
335+
def collected(engine, q):
336+
df = _to_pd(g.gfql(q, engine=engine)._nodes).sort_values("k")
337+
return {r["k"]: list(r["vs"]) for _, r in df.iterrows()}
338+
339+
for q, expected in [
340+
("MATCH ()-[r]->() RETURN r.k AS k, collect(r.v) AS vs", {"a": ["x", "w"], "b": ["y", "y"], "c": []}),
341+
("MATCH ()-[r]->() RETURN r.k AS k, collect(DISTINCT r.v) AS vs", {"a": ["x", "w"], "b": ["y"], "c": []}),
342+
]:
343+
assert collected("pandas", q) == expected
344+
assert collected("polars", q) == expected
345+
346+
309347
def test_size_list_runs_natively_on_polars():
310348
"""size(<List column>) MUST lower NATIVELY (list.len). The List column is built by
311349
the already-native list-literal with_ so the operand dtype is List, and size is an

0 commit comments

Comments
 (0)