Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
- **Polars support**: `polars.DataFrame` and `polars.LazyFrame` now work in `plot()`, `materialize_nodes()`, `get_degrees()`, `get_indegrees()`, `get_outdegrees()`, and `hypergraph()`. Polars is an optional dependency — no behavior change when not installed. Upload path uses efficient Arrow conversion (`to_arrow()` with schema-metadata stripping and memoization); compute/hypergraph paths coerce to pandas at entry. `LazyFrame` is materialized via `.collect()` at each boundary. Adds `test_polars.py` with 17 tests; skips gracefully when polars is absent (#1133).

### Internal
- **GFQL / Cypher lowering — secondary whole-row carry survives chained reentry boundaries (#1256, #989, #999 partial)**: Two extensions to `_demote_secondary_whole_row_aliases` in `graphistry/compute/gfql/cypher/lowering.py` so multi-alias carry survives a chained reentry compile. (1) **Forwarding-item drop** — bare-identifier projection items in downstream `WITH` stages whose name is a secondary alias are dropped at compile time (the same intent as slice 4.3c, integrated into the post-#1071 active rewrite path). Without this, a forwarding pattern like `WITH a, x, friend, ...` triggered the bare-ref failfast even though the carry already lives as a hidden scalar on the reentry-source's row table. (2) **Hidden-column forwarding** — for every `(secondary_alias, prop)` reference collected from trailing clauses, the synthesized `__cypher_reentry_<S>_<X>__` hidden alias is appended as a bare passthrough item to every downstream `WITH` stage so each recursive `compile_cypher_query` call sees it as a scalar carry. Without this, the inner compile failed alias resolution on the rewritten hidden identifier (`Unknown Cypher alias '__cypher_reentry_x_id__' in RETURN clause`) once a chained reentry stage narrowed the projection scope. New positive tests cover (a) chained reentry where the trailing MATCH continues to use the same primary alias, (b) chained reentry where the source is rebound between boundaries (`MATCH (a)-[:R]->(friend) ... MATCH (friend)-[:S]->(c)`), and (c) multi-alias `DISTINCT` forwarding through a single boundary. New failfast test pins the remaining gap — a trailing MATCH that does not start from any carried alias (free-form intermediate MATCH, the LDBC SNB IC3 prefix shape) — to the existing scoped error so future closure of that lane is regression-locked. Closes the chained-reentry portion of #1256; the free-form intermediate-MATCH case remains open follow-up.
- **GFQL / Cypher lowering — `ReentryPlan` IR + multi-alias whole-row carry slices (#989, #1026, #999 partial)**: Introduces an explicit `ReentryPlan` + `CarriedAlias` dataclass (`graphistry/compute/gfql/cypher/reentry_plan.py`) as the compile-time contract between a prefix `WITH` stage and the trailing `MATCH`, replacing the implicit handshake spread across tuple returns from `_bounded_reentry_carry_columns`, the `scalar_reentry_alias` / `scalar_reentry_columns` fields on `CompiledCypherExecutionExtras`, and runtime contract re-extraction in `_compiled_query_reentry_contract` (#987 step 1). Plan is exposed via `compiled_query.reentry_plan` and threaded through `_map_terminal_reentry_query` + `_attach_graph_context`. Builds three additional admit slices on top of the #1071 lift: (1) **slice 4.3a** lifts the residual single-whole-row gate at the compile site (`lowering.py`) and runtime contract (`gfql_unified.py`) so `WITH a, x` admits whenever only the trailing-MATCH source alias is referenced downstream — `ReentryPlan.aliases` records all whole-row aliases as `CarriedAlias` entries, with downstream non-source-alias bare references emitting an actionable failfast pointing to #989; (2) **slice 4.3b** adds a compile-time prefix rewrite (`_rewrite_multi_whole_row_prefix`) that turns `WITH a, x` into `WITH a, x.id AS __carry_x__id__` for every property of `x` referenced in trailing clauses (collected via `_collect_non_source_alias_property_refs` walking `WhereClause.expr_tree`, `ProjectionStage.where`, `ReturnItem`, `OrderItem`), and AST-rewrites trailing `<non_source>.<prop>` references to property access on the reentry-alias's hidden column — closes the multi-alias case of `#1026` regression-lock (`MATCH (a), (x) WITH a, x OPTIONAL MATCH (a)-->(b) RETURN x.id, b.id` now executes correctly with left-outer-join semantics, previously raised `GFQLValidationError`); (3) **slice 4.3c** drops bare carried-alias items at compile time when downstream `WITH a, x, y, collect(...)` re-projects them, so the bare-ref failfast does not false-positive on forwarding patterns. Adds 5 new positive/structure tests + 2 failfast-scope tests, retargets one prior `pytest.raises(GFQLValidationError)` regression-lock to a positive row assertion, and adds the multi-alias `OPTIONAL MATCH` regression test `test_issue_1026_multi_alias_with_optional_match_carries_secondary_property`. Cross-reentry-boundary forwarding (carry survival across `MATCH (a)-[:KNOWS]-(friend)`, the IC3 LDBC SNB shape) remains follow-up work tracked as #999 / slice 4.3d (#989, #1026, #999, #987).
- **GFQL / Cypher lowering — multi-alias carry through `WITH` before MATCH re-entry (#1071)**: Lifted the residual constraint that `WITH` before a MATCH re-entry must project exactly one whole-row node alias. The local Cypher compiler now accepts patterns like `MATCH (p)-[:KNOWS]-(friend) WITH p, friend MATCH (friend)-[:IS_LOCATED_IN]->(c) RETURN p.firstName, c.name` (LDBC SNB IC1 shape). Implementation pre-rewrites the prefix `WITH` in `_compile_bounded_reentry_query`: the trailing-MATCH primary alias is preserved as the sole whole-row carry; secondary aliases are demoted to scalar property carries via synthesized `S.X AS __cypher_reentry_<S>_<X>__` items, with downstream `S.X` references rewritten to bare hidden identifiers that compose with the existing single-alias carried-scalar machinery (#1047 / #1068). Returning a secondary alias as a whole-row entity (`RETURN s`) and re-binding a secondary alias as a node variable in the trailing MATCH remain unsupported with precise errors. Two prior failfast tests retargeted to assert correct multi-alias behavior; new tests cover IC1-shape, three-alias carry, and the secondary-whole-row-return rejection.
- **GFQL / Cypher lowering — OR/XOR around pattern predicates (#1236, #1031 slice 4)**: Local Cypher `WHERE` boolean trees now support disjunctive compositions that mix pattern-existence leaves with row expressions (for example `pattern OR expr`, `pattern XOR expr`). Lowering rewrites pattern leaves into correlated `semi_apply_mark(...)` row pre-filters that emit boolean marker columns, then evaluates the rewritten boolean expression in `where_rows(...)`. Added direct row-pipeline/safelist coverage for `semi_apply_mark` plus parser/lowering runtime regression tests for OR/XOR-around-pattern query shapes.
Expand Down
76 changes: 75 additions & 1 deletion graphistry/compute/gfql/cypher/lowering.py
Original file line number Diff line number Diff line change
Expand Up @@ -7894,9 +7894,23 @@ def rewrite_text(expr: ExpressionText, field: str) -> ExpressionText:
where_clause if where_clause is None else _rewrite_where_clause_and_resync(where_clause, rewrite_text, "where")
for where_clause in query.reentry_wheres
)
# Slice 4.3d.1 (#1256): Drop bare-identifier projection items that simply
# forward a secondary alias (`WITH a, x, y, collect(...)` pattern in IC3
# multi-stage chains). Their property carries already live as hidden columns
# on the reentry-source's row table; the bare items are pure forwarding
# noise. Without this pre-clean the bare-ref scanner inside
# `_collect_secondary_property_refs` would fail-fast on what is in fact a
# forwarding pattern, blocking IC3 even after #1248 admits the prefix WITH.
secondary_forwarding_re = re.compile(r"[A-Za-z_][A-Za-z0-9_]*")
cleaned_with_stages_tail = tuple(
_drop_bare_alias_items_from_stage(
stage, secondary_aliases, identifier_re=secondary_forwarding_re
)
for stage in query.with_stages[1:]
)
rewritten_with_stages_tail = tuple(
_rewrite_reentry_projection_stage(stage, rewrite_expr=rewrite_text)
for stage in query.with_stages[1:]
for stage in cleaned_with_stages_tail
)
rewritten_unwinds = tuple(
replace(unwind, expression=rewrite_text(unwind.expression, "unwind"))
Expand Down Expand Up @@ -7946,6 +7960,66 @@ def rewrite_text(expr: ExpressionText, field: str) -> ExpressionText:
clause=replace(prefix_stage.clause, items=tuple(new_items)),
)

# Slice 4.3d.2 (#1256): forward hidden carry columns through every downstream
# WITH stage so each recursive bounded-reentry compile sees them as scalar
# carries. Without this, the hidden column survives the first boundary (it
# is in the prefix) but the subsequent WITH/RETURN scope-narrowing drops it,
# and references like `RETURN x.id AS xid` (rewritten to a bare hidden
# identifier) fail at the inner compile's alias resolution.
#
# Interaction with DISTINCT in downstream stages: appending the carry as a
# bare item makes it a participant in DISTINCT key sets. For multi-alias
# carry semantics this is what callers want — DISTINCT over `(friend, x.id)`
# is the desired behavior when `x.id` is referenced downstream. Multi-row
# `x` cases that could observably mutate row count are blocked upstream by
# the pre-existing `unique carried node rows` failfast in `gfql_unified.py`.
#
# Aggregate guard (W2-IMPORTANT-1): if any downstream WITH stage contains
# an aggregate function call, refuse to forward the carry through it. The
# alternative — silently appending the hidden alias next to `count(*)` — can
# produce a wrong NULL value in the projected column when the trailing MATCH
# has no relationship to trigger the existing aggregate failfast. Better to
# raise a scoped #1256 error pointing at the gap than to risk silent wrong
# results. The relationship-pattern aggregate path is also covered by an
# earlier failfast; this guard is a single tighter rule that subsumes both.
if refs_collected and rewritten_with_stages_tail:
for stage in rewritten_with_stages_tail:
for item in stage.clause.items:
try:
item_node = parse_expr(item.expression.text)
except (GFQLExprParseError, ImportError):
continue
if _contains_aggregate_call(item_node):
raise _unsupported_at_span(
"Cypher MATCH after WITH chained-reentry secondary-alias carry "
"does not yet survive a downstream aggregating WITH stage; "
"tracked under #1256",
field="with",
value=item.expression.text,
span=stage.span,
)
forwarded_items: List[ReturnItem] = []
for alias_name, prop in sorted(refs_collected):
hidden_alias = _secondary_reentry_hidden_column_name(alias_name, prop)
forwarded_items.append(
ReturnItem(
expression=ExpressionText(text=hidden_alias, span=template_span),
alias=None,
span=template_span,
)
)
forwarded_tuple = tuple(forwarded_items)
rewritten_with_stages_tail = tuple(
replace(
stage,
clause=replace(
stage.clause,
items=stage.clause.items + forwarded_tuple,
),
)
for stage in rewritten_with_stages_tail
)

rewritten_query = replace(
query,
with_stages=(rewritten_prefix_stage,) + rewritten_with_stages_tail,
Expand Down
154 changes: 154 additions & 0 deletions graphistry/tests/compute/gfql/cypher/test_lowering.py
Original file line number Diff line number Diff line change
Expand Up @@ -8022,6 +8022,160 @@ def test_string_cypher_failfast_rejects_multi_whole_row_prefix_when_non_source_a
_mk_multi_stage_reentry_graph().gfql(query)


def test_string_cypher_chained_reentry_with_repeated_primary_hits_unique_carried_rows_failfast() -> None:
"""#1256 known limitation: when the trailing MATCH re-uses the SAME primary
alias across multiple reentry boundaries (``WITH a, x MATCH (a)-[:R]->(friend)
WITH a, x, friend MATCH (a)-[:R]->(other) ...``), the second reentry's
recursive compile receives multiple rows that share the same carried ``a``
value (one per friend), tripping the pre-existing
``unique carried node rows`` runtime check at ``gfql_unified.py:~1091``.

This is not a regression introduced by slice 4.3d.2 — the carry-uniqueness
constraint is fundamental to the current scalar-carry runtime model. The
rebinding shape (Q2-style ``(a)-[:R]->(friend) ... (friend)-[:S]->(c)``)
avoids this because the primary alias rebinds each boundary so each
recursive prefix has unique carried-node identity. Lock this here so a
future slice that lifts the unique-rows constraint must update this test
alongside the runtime change.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (a)-[:R]->(friend) "
"WITH a, x, friend "
"MATCH (a)-[:R]->(other) "
"WHERE other.id <> friend.id "
"RETURN other.id AS oid, x.id AS xid ORDER BY oid"
)
with pytest.raises(
GFQLValidationError,
match=r"unique carried node rows",
):
_mk_multi_stage_reentry_graph().gfql(query)


def test_string_cypher_admits_secondary_alias_carry_across_reentry_source_rebinding() -> None:
"""#1256 slice 4.3d.2: secondary alias carry survives a reentry-source rebinding.

`WITH a, x MATCH (a)-[:R]->(friend) WITH friend, x MATCH (friend)-[:S]->(c)
RETURN c.id, x.id` rebinds the trailing-MATCH source from `a` to `friend`
between the two reentry boundaries. The carry `x.id` lives as a hidden
column on `a`'s row table; after the join with friend, on friend's table;
the second boundary's prefix WITH must continue to forward it as a scalar
so the inner compile can resolve it.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (a)-[:R]->(friend) "
"WITH friend, x "
"MATCH (friend)-[:S]->(c) "
"RETURN c.id AS cid, x.id AS xid ORDER BY cid"
)
result = _mk_multi_stage_reentry_graph().gfql(query)
assert result._nodes.to_dict(orient="records") == [
{"cid": "c", "xid": "b"},
{"cid": "e", "xid": "b"},
]


def test_string_cypher_admits_multi_alias_distinct_forwarding_through_reentry() -> None:
"""#1256 slice 4.3d.1: bare-identifier projection items in downstream
``WITH a, x, friend`` (pure forwarding through a reentry boundary) are
dropped at compile time so the bare-ref scanner does not false-positive.

Before slice 4.3d.1, the active rewrite path
(``_demote_secondary_whole_row_aliases``) failed at line ~7900
("does not yet support carrying secondary whole-row aliases as whole-row
outputs") because the bare ``x`` in stage[1] was treated as a true USE.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (a)-[:R]->(friend) "
"WITH DISTINCT a, x, friend "
"RETURN friend.id AS fid, x.id AS xid ORDER BY fid"
)
result = _mk_multi_stage_reentry_graph().gfql(query)
assert result._nodes.to_dict(orient="records") == [
{"fid": "b", "xid": "b"},
{"fid": "e", "xid": "b"},
]


def test_string_cypher_chained_reentry_carry_with_aggregate_relationship_match_failfast() -> None:
"""#1256 wave-1 review W1-I1 regression-lock: chained-reentry secondary-alias
carry combined with an aggregating downstream WITH stage following a
relationship-pattern MATCH is rejected with the scoped #1256 failfast.
The aggregate guard added in slice 4.3d.2 fires before the existing
relationship-multiplicity aggregate check would; the scoped error is
clearer about the actual gap.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (a)-[:R]->(friend) "
"WITH a, x, count(*) AS n "
"RETURN x.id AS xid, n"
)
with pytest.raises(
GFQLValidationError,
match=r"chained-reentry secondary-alias carry does not yet survive",
):
_mk_multi_stage_reentry_graph().gfql(query)


def test_string_cypher_chained_reentry_carry_with_aggregate_node_only_match_failfast() -> None:
"""#1256 wave-2 review W2-IMPORTANT-1 regression-lock: chained-reentry
secondary-alias carry combined with an aggregating downstream WITH stage
following a node-only MATCH (no relationship) is rejected with the scoped
#1256 failfast.

Without the aggregate guard, this shape returned a silent NULL because the
pre-existing relationship-pattern aggregate failfast did NOT fire (no
relationship in the trailing MATCH) and the appended hidden carry column
interacted poorly with the count grouping path. The guard ensures the user
gets a clear, scoped error pointing at #1256 instead of silent wrong data.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (a) "
"WITH a, x, count(*) AS n "
"RETURN x.id AS xid, n"
)
with pytest.raises(
GFQLValidationError,
match=r"chained-reentry secondary-alias carry does not yet survive",
):
_mk_multi_stage_reentry_graph().gfql(query)


def test_string_cypher_failfast_rejects_intermediate_reentry_match_with_no_carried_source() -> None:
"""#1256 slice 4.3d remaining gap: trailing MATCH that does NOT start from a
carried alias (free-form intermediate MATCH) is still unsupported.

The literal LDBC SNB IC3 query begins ``WITH a, x, y MATCH (city:City)
-[:IS_PART_OF]->(country:Country) ...`` — neither ``city`` nor ``country``
is a carried alias. Closing this requires admitting the trailing MATCH as
a fresh seed pattern that cross-joins with the carried row table; the
runtime then propagates carries onto the new bindings. This is the
remaining slice 4.3d work tracked under #1256.
"""
query = (
"MATCH (a:A {id: 'a'}), (x:B {id: 'b'}) "
"WITH a, x "
"MATCH (c:C)-[:T]->(d:D) "
"WHERE c.id = x.id OR c.id = 'c' "
"RETURN d.id AS did"
)
with pytest.raises(
GFQLValidationError,
match=r"(trailing MATCH to start from the same carried node alias|carries non-source whole-row aliases)",
):
_mk_multi_stage_reentry_graph().gfql(query)


def test_string_cypher_failfast_rejects_with_match_reentry_multiple_trailing_match_clauses() -> None:
query = (
"MATCH (a:A {id: $seed})-[:R]->(b:B) "
Expand Down
Loading