Skip to content

Commit cf59dd3

Browse files
committed
feat(gfql): infer typed graph schemas
1 parent 9b4a272 commit cf59dd3

8 files changed

Lines changed: 661 additions & 17 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
99
<!-- Do Not Erase This Section - Used for tracking unreleased changes -->
1010

1111
### Added
12+
- **GFQL schema inference API (#1338)**: Added experimental `graphistry.infer_schema(g)`, `g.infer_schema()`, and `g.bind(infer_schema=True)` for opt-in public `GraphSchema` inference from bound local graph data. Inference derives node/edge property logical types, presence/nullability report details, `label__*` node and relationship labels, and source/destination topology when node label evidence is available. Declared schemas remain explicit and take precedence when passed to `infer_schema(..., schema=...)`; `bind(schema=..., infer_schema=True)` is rejected instead of silently merging contracts.
1213
- **GFQL NetworkX CALL parity (#1058)**: Expanded the local Cypher `graphistry.nx.*` CALL surface with explicit NetworkX dispatch for `degree_centrality`, `closeness_centrality`, `eigenvector_centrality`, `katz_centrality`, `connected_components`, `strongly_connected_components`, `core_number`, and multi-output `hits`, including row and `.write()` coverage.
1314
- **NetworkX/SciPy optional dependency policy (#1618)**: Declared supported `networkx>=2.5,<4` and optional `scipy>=1.5,<2` ranges for NetworkX-backed GFQL CALL procedures, with runtime version guards and a focused lower/current-upper CI matrix.
1415
- **GFQL public schema declarations (#1337)**: Added experimental `graphistry.schema` exports for `NodeType`, `EdgeType`, `GraphSchema`, and `EdgeTopology`, plus top-level `graphistry` re-exports. `NodeType` and `EdgeType` accept Arrow-first `pyarrow.Schema` declarations, preserve dtype/nullability through GFQL `RowSchema`, and export back to Arrow with label/type columns via `to_arrow()`. `graphistry.bind(..., schema=schema)` / `g.bind(schema=schema)` attach public schema declarations to plotters, and Cypher preflight validation consumes the adapted internal `GraphSchemaCatalog` for declared labels, properties, relationship types, and source/destination topology checks. `GraphSchema(strict=False)` makes schema-bound `g.gfql_validate(...)` permissive by default while explicit call-level `strict=True` still forces strict validation.

docs/source/gfql/schema.rst

Lines changed: 89 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ GFQL accepts public schema declarations through the stable
55
``graphistry.schema`` import path. Use this when application code owns a graph
66
contract and wants Cypher preflight checks to fail before query execution.
77
The API is experimental in this release: the import path and core declaration
8-
objects are intended to be stable, while inference, coercion, remote transport,
9-
and planner use are still follow-on surfaces.
8+
objects are intended to be stable, while coercion, remote transport, and
9+
planner use are still follow-on surfaces. Inference is also experimental and
10+
must be requested explicitly.
1011

1112
The schema is optional. When you provide one, PyGraphistry uses it as the
1213
declared contract for local GFQL validation. When you do not provide one,
@@ -118,27 +119,100 @@ This is a correctness and documentation surface first: applications can state
118119
what labels, relationship types, properties, and topology they expect, then
119120
validate user-authored or generated Cypher before running it. The same typed
120121
contract is also the foundation for later inference, coercion, remote transport,
121-
and planner/performance work, but this page covers the declared local contract.
122+
and planner/performance work.
122123

123-
Provided vs. Inferred Schema
124-
----------------------------
124+
Provided Or Inferred Schema
125+
---------------------------
125126

126-
In this release, schemas are **provided**, not inferred. You create
127-
``NodeType``, ``EdgeType``, and ``GraphSchema`` objects directly and attach them
128-
with ``graphistry.bind(..., schema=schema)`` or ``g.bind(schema=schema)``.
127+
You can provide a schema directly or infer one from bound local data.
129128

130-
Without an explicit ``GraphSchema``:
129+
Use a provided schema when application code owns the contract:
131130

132-
* ``g.gfql_validate(...)`` can still use local dataframe columns already bound
133-
on ``g._nodes`` and ``g._edges`` for schema-aware checks.
134-
* It does not infer node types, edge types, Arrow dtypes, nullability, or
135-
topology from data.
131+
.. code-block:: python
132+
133+
g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id")
134+
g = g.bind(schema=schema)
135+
136+
Use inference when the graph data should define the first draft contract:
137+
138+
.. code-block:: python
139+
140+
schema = g.infer_schema()
141+
g = g.bind(schema=schema)
142+
143+
For one-step local binding, use:
144+
145+
.. code-block:: python
146+
147+
g = g.bind(infer_schema=True)
148+
149+
Inference is opt-in. ``graphistry.bind(...)`` and ``g.bind(...)`` do not infer a
150+
schema unless ``infer_schema=True`` is passed.
151+
152+
Inference Rules
153+
---------------
154+
155+
``graphistry.infer_schema(g)`` and ``g.infer_schema()`` return a public
156+
``GraphSchema``. They inspect currently bound ``nodes`` and ``edges`` dataframes:
157+
158+
* Node types come from boolean ``label__<Label>`` columns on the node table.
159+
* Edge types come from boolean ``label__<TYPE>`` columns on the edge table.
160+
* Node properties are non-label node columns observed on rows for a label.
161+
* Edge properties are non-label edge columns, excluding the bound source,
162+
destination, and edge-id columns.
163+
* Source/destination topology is inferred when edges reference bound node ids
164+
and those nodes have label columns. Edge-only graphs keep edge types and
165+
properties, but do not invent endpoint labels.
136166
* A remote-only graph such as ``graphistry.bind(dataset_id="...")`` has no
137167
local dataframe columns, so local validation is limited to syntax, compile,
138168
and structural checks unless you also bind a declared schema.
139169

140-
Schema inference from existing plottables is tracked separately from this
141-
declared-schema API.
170+
Inference uses the same Arrow/GFQL row-schema bridge as declared schemas for
171+
logical property types. The returned ``GraphSchema`` can be passed to
172+
``g.bind(schema=schema)`` and used by ``g.gfql_validate(...)``.
173+
174+
Presence And Nullability
175+
------------------------
176+
177+
The public ``GraphSchema`` stores the inferred logical type and scalar
178+
nullability needed by validation. For more detail, request the experimental
179+
report:
180+
181+
.. code-block:: python
182+
183+
schema, report = g.infer_schema(return_report=True)
184+
185+
The report tracks property presence separately from type:
186+
187+
``required``
188+
The property has observed values on every row for that node label or edge
189+
type.
190+
191+
``optional``
192+
The property has observed values on some rows and nulls on other rows for
193+
that node label or edge type.
194+
195+
``maybe_absent``
196+
The column exists on the dataframe but has no observed value for that node
197+
label or edge type. This commonly means another label/type uses the column.
198+
199+
``unknown``
200+
No rows were available for that node label or edge type.
201+
202+
Declared Overrides
203+
------------------
204+
205+
Declared schemas stay explicit. Passing ``schema=...`` to ``infer_schema()``
206+
uses declared node and edge definitions in preference to inferred definitions
207+
with the same names, while keeping inferred definitions for other names.
208+
209+
.. code-block:: python
210+
211+
schema = g.infer_schema(schema=declared_schema)
212+
213+
``g.bind(schema=..., infer_schema=True)`` is rejected. Use either a provided
214+
schema or inferred schema in a single bind call so the validation contract is
215+
unambiguous.
142216

143217
Local vs. Remote GFQL
144218
---------------------

graphistry/Plottable.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,9 +364,13 @@ def bind(
364364
nodes_file_id: Optional[str] = None,
365365
edges_file_id: Optional[str] = None,
366366
schema: Optional[Any] = None,
367+
infer_schema: Any = False,
367368
) -> 'Plottable':
368369
...
369370

371+
def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
372+
...
373+
370374
def copy(self) -> 'Plottable':
371375
...
372376

graphistry/PlotterBase.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1618,6 +1618,7 @@ def bind(self,
16181618
nodes_file_id: Optional[str] = None,
16191619
edges_file_id: Optional[str] = None,
16201620
schema: Optional[Any] = None,
1621+
infer_schema: Any = False,
16211622
) -> Plottable:
16221623
"""Relate data attributes to graph structure and visual representation. To facilitate reuse and replayable notebooks, the binding call is chainable. Invocation does not effect the old binding: it instead returns a new Plotter instance with the new bindings added to the existing ones. Both the old and new bindings can then be used for different graphs.
16231624
@@ -1690,6 +1691,9 @@ def bind(self,
16901691
:param schema: Optional experimental public GFQL schema declaration from ``graphistry.schema``.
16911692
:type schema: Optional[Any]
16921693
1694+
:param infer_schema: Infer an experimental public GFQL schema from currently bound data and attach it.
1695+
:type infer_schema: bool
1696+
16931697
:returns: Plotter
16941698
:rtype: Plotter
16951699
@@ -1769,7 +1773,16 @@ def bind(self,
17691773
res._url = url or self._url
17701774
res._nodes_file_id = nodes_file_id or self._nodes_file_id
17711775
res._edges_file_id = edges_file_id or self._edges_file_id
1772-
res._gfql_schema = schema if schema is not None else self._gfql_schema
1776+
if schema is not None and infer_schema:
1777+
raise ValueError("schema and infer_schema cannot both be set")
1778+
if infer_schema and self._gfql_schema is not None:
1779+
raise ValueError("schema and infer_schema cannot both be set")
1780+
if infer_schema:
1781+
from graphistry.schema_inference import infer_schema as _infer_schema
1782+
1783+
res._gfql_schema = _infer_schema(res)
1784+
else:
1785+
res._gfql_schema = schema if schema is not None else self._gfql_schema
17731786

17741787
# Invalidate dataset_id if we're changing encodings, not setting IDs
17751788
encoding_params_changed = any([
@@ -1788,6 +1801,12 @@ def bind(self,
17881801
def copy(self) -> Plottable:
17891802
return copy.copy(self)
17901803

1804+
def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
1805+
"""Infer an experimental public GFQL schema from currently bound data."""
1806+
from graphistry.schema_inference import infer_schema
1807+
1808+
return infer_schema(self, schema=schema, return_report=return_report)
1809+
17911810

17921811
def nodes(self, nodes: Union[Callable, Any], node=None, *args, **kwargs) -> Plottable:
17931812
"""Specify the set of nodes and associated data.

graphistry/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
encode_point_badge,
4343
encode_edge_badge,
4444
apply_encodings,
45+
infer_schema,
4546
hypergraph,
4647
bolt,
4748
cypher,
@@ -140,6 +141,12 @@
140141
NodeType,
141142
)
142143

144+
from graphistry.schema_inference import (
145+
InferredProperty,
146+
PresenceState,
147+
SchemaInferenceReport,
148+
)
149+
143150
from graphistry.privacy import (
144151
Mode, Privacy
145152
)

graphistry/pygraphistry.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1925,12 +1925,14 @@ def bind(
19251925
nodes_file_id: Optional[str] = None,
19261926
edges_file_id: Optional[str] = None,
19271927
schema: Optional[Any] = None,
1928+
infer_schema: Any = False,
19281929
) -> Plotter:
19291930
"""Create a base plotter.
19301931
19311932
Typically called at start of a program. For parameters, see ``plotter.bind()`` .
19321933
The ``schema`` parameter accepts the experimental public GFQL schema
1933-
declarations from ``graphistry.schema``.
1934+
declarations from ``graphistry.schema``. ``infer_schema=True`` infers
1935+
that schema from currently bound local data.
19341936
19351937
:returns: Plotter
19361938
:rtype: Plotter
@@ -1974,8 +1976,17 @@ def bind(
19741976
nodes_file_id=nodes_file_id,
19751977
edges_file_id=edges_file_id,
19761978
schema=schema,
1979+
infer_schema=infer_schema,
19771980
))
19781981

1982+
def infer_schema(self, g: Optional[Any] = None, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
1983+
"""Infer an experimental public GFQL schema from a plotter."""
1984+
from graphistry.schema_inference import infer_schema
1985+
1986+
if g is None:
1987+
raise ValueError("graphistry.infer_schema(g) requires a plotter; use g.infer_schema() for bound graphs")
1988+
return infer_schema(g, schema=schema, return_report=return_report)
1989+
19791990
def from_dataset_id(self, dataset_id: str, api_token: Optional[str] = None) -> Plotter:
19801991
"""Fetch existing remote dataset metadata and hydrate a Plotter.
19811992
@@ -2763,6 +2774,7 @@ def _handle_api_response(self, response):
27632774
encode_point_badge = PyGraphistry.encode_point_badge
27642775
encode_edge_badge = PyGraphistry.encode_edge_badge
27652776
apply_encodings = PyGraphistry.apply_encodings
2777+
infer_schema = PyGraphistry.infer_schema
27662778
infer_labels = PyGraphistry.infer_labels
27672779
name = PyGraphistry.name
27682780
description = PyGraphistry.description

0 commit comments

Comments
 (0)