Skip to content

Commit a2c9c81

Browse files
committed
feat(gfql): infer typed graph schemas
1 parent dcdf93b commit a2c9c81

8 files changed

Lines changed: 670 additions & 16 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
99
<!-- Do Not Erase This Section - Used for tracking unreleased changes -->
1010

1111
### Added
12+
- **GFQL schema inference API (#1338)**: Added experimental `graphistry.infer_schema(g)`, `g.infer_schema()`, and `g.bind(infer_schema=True)` for opt-in public `GraphSchema` inference from bound local graph data. Inference derives node/edge property logical types, presence/nullability report details, `label__*` node and relationship labels, and source/destination topology when node label evidence is available. Declared schemas remain explicit and take precedence when passed to `infer_schema(..., schema=...)`; `bind(schema=..., infer_schema=True)` is rejected instead of silently merging contracts.
1213
- **GFQL NetworkX CALL parity (#1058)**: Expanded the local Cypher `graphistry.nx.*` CALL surface with explicit NetworkX dispatch for `degree_centrality`, `closeness_centrality`, `eigenvector_centrality`, `katz_centrality`, `connected_components`, `strongly_connected_components`, `core_number`, and multi-output `hits`, including row and `.write()` coverage.
1314
- **NetworkX/SciPy optional dependency policy (#1618)**: Declared supported `networkx>=2.5,<4` and optional `scipy>=1.5,<2` ranges for NetworkX-backed GFQL CALL procedures, with runtime version guards and a focused lower/current-upper CI matrix.
1415
- **GFQL schema Arrow boundary APIs (#1339)**: Added experimental public schema↔Arrow import/export helpers, graph-level Arrow declaration payloads, and opt-in `schema_validate='strict'|'autofix'` enforcement for `plot()`, `upload()`, `to_arrow()`, and `validate_arrow_schema()` when a `GraphSchema` is bound.

docs/source/gfql/schema.rst

Lines changed: 98 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ GFQL accepts public schema declarations through the stable
55
``graphistry.schema`` import path. Use this when application code owns a graph
66
contract and wants Cypher preflight checks to fail before query execution.
77
The API is experimental in this release: the import path and core declaration
8-
objects are intended to be stable, while inference, coercion, remote transport,
9-
and planner use are still follow-on surfaces.
8+
objects are intended to be stable, while coercion, remote transport, and
9+
planner use are still follow-on surfaces. Inference is also experimental and
10+
must be requested explicitly.
1011

1112
The schema is optional. When you provide one, PyGraphistry uses it as the
1213
declared contract for local GFQL validation. When you do not provide one,
@@ -128,8 +129,8 @@ Invalid queries raise ``GFQLValidationError`` with structured context.
128129
This is a correctness and documentation surface first: applications can state
129130
what labels, relationship types, properties, and topology they expect, then
130131
validate user-authored or generated Cypher before running it. The same typed
131-
contract is also the foundation for later inference, coercion, remote transport,
132-
and planner/performance work, but this page covers the declared local contract.
132+
contract is also used by inference and is the foundation for later coercion,
133+
remote transport, and planner/performance work.
133134

134135
Arrow Boundary Validation
135136
-------------------------
@@ -163,22 +164,105 @@ boundaries. This is off by default so existing ``plot()``, ``upload()``, and
163164
Provided vs. Inferred Schema
164165
----------------------------
165166

166-
In this release, schemas are **provided**, not inferred. You create
167-
``NodeType``, ``EdgeType``, and ``GraphSchema`` objects directly and attach them
168-
with ``graphistry.bind(..., schema=schema)`` or ``g.bind(schema=schema)``.
167+
You can provide a schema directly or infer one from bound local data.
169168

170-
Without an explicit ``GraphSchema``:
169+
Use a provided schema when application code owns the contract:
171170

172-
* ``g.gfql_validate(...)`` can still use local dataframe columns already bound
173-
on ``g._nodes`` and ``g._edges`` for schema-aware checks.
174-
* It does not infer node types, edge types, Arrow dtypes, nullability, or
175-
topology from data.
171+
.. code-block:: python
172+
173+
declared_g = (
174+
graphistry
175+
.edges(edges_df, "src", "dst")
176+
.nodes(nodes_df, "id")
177+
.bind(schema=schema)
178+
)
179+
180+
Use inference when the graph data should define the first draft contract:
181+
182+
.. code-block:: python
183+
184+
inferred_base_g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id")
185+
inferred_schema = inferred_base_g.infer_schema()
186+
inferred_g = inferred_base_g.bind(schema=inferred_schema)
187+
188+
For one-step local binding, use:
189+
190+
.. code-block:: python
191+
192+
inferred_g = (
193+
graphistry
194+
.edges(edges_df, "src", "dst")
195+
.nodes(nodes_df, "id")
196+
.bind(infer_schema=True)
197+
)
198+
199+
Inference is opt-in. ``graphistry.bind(...)`` and ``g.bind(...)`` do not infer a
200+
schema unless ``infer_schema=True`` is passed.
201+
202+
Inference Rules
203+
---------------
204+
205+
``graphistry.infer_schema(g)`` and ``g.infer_schema()`` return a public
206+
``GraphSchema``. They inspect currently bound ``nodes`` and ``edges`` dataframes:
207+
208+
* Node types come from boolean ``label__<Label>`` columns on the node table.
209+
* Edge types come from boolean ``label__<TYPE>`` columns on the edge table.
210+
* Node properties are non-label node columns observed on rows for a label.
211+
* Edge properties are non-label edge columns, excluding the bound source,
212+
destination, and edge-id columns.
213+
* Source/destination topology is inferred when edges reference bound node ids
214+
and those nodes have label columns. Edge-only graphs keep edge types and
215+
properties, but do not invent endpoint labels.
176216
* A remote-only graph such as ``graphistry.bind(dataset_id="...")`` has no
177217
local dataframe columns, so local validation is limited to syntax, compile,
178218
and structural checks unless you also bind a declared schema.
179219

180-
Schema inference from existing plottables is tracked separately from this
181-
declared-schema API.
220+
Inference uses the same Arrow/GFQL row-schema bridge as declared schemas for
221+
logical property types. The returned ``GraphSchema`` can be passed to
222+
``g.bind(schema=schema)`` and used by ``g.gfql_validate(...)``.
223+
224+
Presence And Nullability
225+
------------------------
226+
227+
The public ``GraphSchema`` stores the inferred logical type and scalar
228+
nullability needed by validation. For more detail, request the experimental
229+
report:
230+
231+
.. code-block:: python
232+
233+
schema, report = g.infer_schema(return_report=True)
234+
235+
The report tracks property presence separately from type:
236+
237+
``required``
238+
The property has observed values on every row for that node label or edge
239+
type.
240+
241+
``optional``
242+
The property has observed values on some rows and nulls on other rows for
243+
that node label or edge type.
244+
245+
``maybe_absent``
246+
The column exists on the dataframe but has no observed value for that node
247+
label or edge type. This commonly means another label/type uses the column.
248+
249+
``unknown``
250+
No rows were available for that node label or edge type.
251+
252+
Declared Overrides
253+
------------------
254+
255+
Declared schemas stay explicit. Passing ``schema=...`` to ``infer_schema()``
256+
uses declared node and edge definitions in preference to inferred definitions
257+
with the same names, while keeping inferred definitions for other names.
258+
259+
.. code-block:: python
260+
261+
refined_schema = g.infer_schema(schema=schema)
262+
263+
``g.bind(schema=..., infer_schema=True)`` is rejected. Use either a provided
264+
schema or inferred schema in a single bind call so the validation contract is
265+
unambiguous.
182266

183267
Local vs. Remote GFQL
184268
---------------------

graphistry/Plottable.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,9 +364,13 @@ def bind(
364364
nodes_file_id: Optional[str] = None,
365365
edges_file_id: Optional[str] = None,
366366
schema: Optional[Any] = None,
367+
infer_schema: Any = False,
367368
) -> 'Plottable':
368369
...
369370

371+
def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
372+
...
373+
370374
def copy(self) -> 'Plottable':
371375
...
372376

graphistry/PlotterBase.py

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1622,6 +1622,7 @@ def bind(self,
16221622
nodes_file_id: Optional[str] = None,
16231623
edges_file_id: Optional[str] = None,
16241624
schema: Optional[Any] = None,
1625+
infer_schema: Any = False,
16251626
) -> Plottable:
16261627
"""Relate data attributes to graph structure and visual representation. To facilitate reuse and replayable notebooks, the binding call is chainable. Invocation does not effect the old binding: it instead returns a new Plotter instance with the new bindings added to the existing ones. Both the old and new bindings can then be used for different graphs.
16271628
@@ -1694,6 +1695,9 @@ def bind(self,
16941695
:param schema: Optional experimental public GFQL schema declaration from ``graphistry.schema``.
16951696
:type schema: Optional[Any]
16961697
1698+
:param infer_schema: Infer an experimental public GFQL schema from currently bound data and attach it.
1699+
:type infer_schema: bool
1700+
16971701
:returns: Plotter
16981702
:rtype: Plotter
16991703
@@ -1773,7 +1777,16 @@ def bind(self,
17731777
res._url = url or self._url
17741778
res._nodes_file_id = nodes_file_id or self._nodes_file_id
17751779
res._edges_file_id = edges_file_id or self._edges_file_id
1776-
res._gfql_schema = schema if schema is not None else self._gfql_schema
1780+
if schema is not None and infer_schema:
1781+
raise ValueError("schema and infer_schema cannot both be set")
1782+
if infer_schema and self._gfql_schema is not None:
1783+
raise ValueError("schema and infer_schema cannot both be set")
1784+
if infer_schema:
1785+
from graphistry.schema_inference import infer_schema as _infer_schema
1786+
1787+
res._gfql_schema = _infer_schema(res)
1788+
else:
1789+
res._gfql_schema = schema if schema is not None else self._gfql_schema
17771790

17781791
# Invalidate dataset_id if we're changing encodings, not setting IDs
17791792
encoding_params_changed = any([
@@ -1792,6 +1805,12 @@ def bind(self,
17921805
def copy(self) -> Plottable:
17931806
return copy.copy(self)
17941807

1808+
def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
1809+
"""Infer an experimental public GFQL schema from currently bound data."""
1810+
from graphistry.schema_inference import infer_schema
1811+
1812+
return infer_schema(self, schema=schema, return_report=return_report)
1813+
17951814

17961815
def nodes(self, nodes: Union[Callable, Any], node=None, *args, **kwargs) -> Plottable:
17971816
"""Specify the set of nodes and associated data.

graphistry/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
encode_point_badge,
4343
encode_edge_badge,
4444
apply_encodings,
45+
infer_schema,
4546
hypergraph,
4647
bolt,
4748
cypher,
@@ -140,6 +141,12 @@
140141
NodeType,
141142
)
142143

144+
from graphistry.schema_inference import (
145+
InferredProperty,
146+
PresenceState,
147+
SchemaInferenceReport,
148+
)
149+
143150
from graphistry.privacy import (
144151
Mode, Privacy
145152
)

graphistry/pygraphistry.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1925,12 +1925,14 @@ def bind(
19251925
nodes_file_id: Optional[str] = None,
19261926
edges_file_id: Optional[str] = None,
19271927
schema: Optional[Any] = None,
1928+
infer_schema: Any = False,
19281929
) -> Plotter:
19291930
"""Create a base plotter.
19301931
19311932
Typically called at start of a program. For parameters, see ``plotter.bind()`` .
19321933
The ``schema`` parameter accepts the experimental public GFQL schema
1933-
declarations from ``graphistry.schema``.
1934+
declarations from ``graphistry.schema``. ``infer_schema=True`` infers
1935+
that schema from currently bound local data.
19341936
19351937
:returns: Plotter
19361938
:rtype: Plotter
@@ -1974,8 +1976,17 @@ def bind(
19741976
nodes_file_id=nodes_file_id,
19751977
edges_file_id=edges_file_id,
19761978
schema=schema,
1979+
infer_schema=infer_schema,
19771980
))
19781981

1982+
def infer_schema(self, g: Optional[Any] = None, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
1983+
"""Infer an experimental public GFQL schema from a plotter."""
1984+
from graphistry.schema_inference import infer_schema
1985+
1986+
if g is None:
1987+
raise ValueError("graphistry.infer_schema(g) requires a plotter; use g.infer_schema() for bound graphs")
1988+
return infer_schema(g, schema=schema, return_report=return_report)
1989+
19791990
def from_dataset_id(self, dataset_id: str, api_token: Optional[str] = None) -> Plotter:
19801991
"""Fetch existing remote dataset metadata and hydrate a Plotter.
19811992
@@ -2763,6 +2774,7 @@ def _handle_api_response(self, response):
27632774
encode_point_badge = PyGraphistry.encode_point_badge
27642775
encode_edge_badge = PyGraphistry.encode_edge_badge
27652776
apply_encodings = PyGraphistry.apply_encodings
2777+
infer_schema = PyGraphistry.infer_schema
27662778
infer_labels = PyGraphistry.infer_labels
27672779
name = PyGraphistry.name
27682780
description = PyGraphistry.description

0 commit comments

Comments
 (0)