Skip to content

Commit 135ed78

Browse files
lmeyerovclaude
andcommitted
docs(gfql): seeded-traversal CSR adjacency index guide (persona P0-3)
The engine-selection guide (#1661) documented all four engines + a decision matrix but the CSR adjacency index — the strongest competitive claim and the exact answer to 'Neo4j has an index, does GFQL?' — was only a footnote. Adds a full guide: create_index/gfql_index_all/show_indexes/drop_index, index_policy (use/auto/force/off), gfql_explain, Cypher DDL + wire protocol, and the sourced numbers (flat-in-N 0.12ms @8M-117M edges; 9-28x vs Kuzu/Neo4j on selective lookups; CPU-wins-seeded vs GPU floor). Honest build-cost + parity-or-fallback section. Wires into the toctree + a seeded-lookup recommended path; shrinks the engines.rst F5 footnote to a cross-link. Persona-driven (round-1 user-testing: Priya/Neo4j-migrant + Maya's slow seeded lookup). Numbers already measured (benchmarks/gfql/index_*bench.py, dgx-spark). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 4ca874b commit 135ed78

3 files changed

Lines changed: 183 additions & 7 deletions

File tree

docs/source/gfql/engines.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -222,13 +222,14 @@ as its in-memory GPU executor comes under memory pressure. Prefer ``cudf`` for t
222222
result as a GPU run (see *Honesty* below).
223223

224224
**[F5] Selective traversal is an indexing problem, not an engine choice.** A seeded ``hop``
225-
from a few nodes is fastest with the opt-in **CSR adjacency index** (``g.create_index(...)``,
226-
``index_policy=``), which turns the O(E) scan into an O(degree) gather. The index works on all
227-
four engines, but seeded work is so small that **CPU wins**: on LiveJournal 35M a typical-seed
228-
1-hop is ~0.13 ms on pandas and ~0.16 ms on Polars (numpy ``searchsorted``) vs ~3 ms on cuDF
229-
(GPU kernel-launch floor) — the clean inverse of bulk, where the GPU pulls ahead. So pick the
230-
index for selective traversal and a CPU engine to drive it. (A dedicated index guide is in
231-
progress; the methods live under the GFQL API.)
225+
from a few nodes is fastest with the opt-in **CSR adjacency index** (``g.gfql_index_all()`` /
226+
``g.create_index(...)``, ``index_policy=``), which turns the O(E) scan into an O(degree)
227+
gather — flat in graph size, and 9–28× faster than Kuzu / Neo4j on selective lookups. It works
228+
on all four engines, but seeded work is so small that **CPU wins**: on LiveJournal 35M a
229+
typical-seed 1-hop is ~0.13 ms on pandas and ~0.16 ms on Polars (numpy ``searchsorted``) vs
230+
~3 ms on cuDF (GPU kernel-launch floor) — the clean inverse of bulk, where the GPU pulls
231+
ahead. So pick the index for selective traversal and a CPU engine to drive it. See
232+
:doc:`index_adjacency` for the full guide.
232233

233234
cuDF vs Polars-GPU
234235
------------------

docs/source/gfql/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ Recommended paths:
3838
- Running Cypher syntax in GFQL: :doc:`cypher` -> :doc:`quick` -> :doc:`return` -> :doc:`spec/cypher_mapping`
3939
- Faster on CPU (no GPU): :doc:`engines` -> :doc:`performance` (one keyword, ``engine='polars'``, up to ~38x over pandas)
4040
- Performance path (intro -> engine choice -> GPU -> remote GPU): :doc:`about` -> :doc:`engines` -> :doc:`performance` -> :doc:`remote`
41+
- Fast seeded lookups (start from known nodes, like a DB index): :doc:`index_adjacency` (O(degree), flat in graph size, 9-28x vs Kuzu/Neo4j)
4142
- Translating existing Cypher to native GFQL: :doc:`spec/cypher_mapping`
4243
- Building agents/integrations: :doc:`spec/language` + :doc:`spec/python_embedding` + :doc:`spec/wire_protocol`
4344

@@ -52,6 +53,7 @@ See also:
5253
overview
5354
remote
5455
Choosing an Engine <engines>
56+
Seeded Traversal Indexes <index_adjacency>
5557
GFQL CPU & GPU Acceleration <performance>
5658
End-to-End Benchmark <benchmark_filter_pagerank>
5759
translate
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
Seeded Traversal Indexes (CSR Adjacency)
2+
========================================
3+
4+
A **seeded** graph query starts from a known set of nodes — "the neighbors of these
5+
50 accounts", "2 hops out from this device" — rather than scanning the whole graph.
6+
By default GFQL answers a seeded ``hop`` with an ``O(E)`` pass over every edge. With an
7+
opt-in **CSR adjacency index**, the same hop becomes an ``O(degree)`` gather: its cost
8+
depends on how many edges the *seeds* touch, not on how big the graph is. The result is
9+
**flat in graph size** — and it beats embedded graph databases on selective lookups.
10+
11+
Nothing changes about the answer. The index is a pay-as-you-go accelerator: a query either
12+
uses a resident index or falls back to the scan, and any feature the index does not cover
13+
also falls back — never a different result.
14+
15+
When to use it
16+
--------------
17+
18+
- **Seeded traversals**: you start from specific node ids (a watchlist, a session, a fraud
19+
ring's known members) and hop out 1–3 steps.
20+
- **Repeated queries** against the same graph: build the index once, amortize it over many
21+
seeded lookups.
22+
- **Interactive / point-lookup latency**: sub-millisecond neighbor expansion.
23+
24+
It does **not** help a full-graph scan (a property filter over every node, a global
25+
PageRank). For those, choose an *engine* instead — see :doc:`engines`.
26+
27+
Quick start
28+
-----------
29+
30+
.. code-block:: python
31+
32+
import graphistry
33+
34+
g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id")
35+
36+
# Build the indexes once (out+in adjacency, plus a node-id accelerator when ids are unique)
37+
g = g.gfql_index_all()
38+
39+
# Seeded query — the index is used automatically (default index_policy='use')
40+
out = g.gfql("MATCH (a)-[e]->(b) WHERE a.id IN $seeds RETURN a, e, b",
41+
params={"seeds": my_seed_ids})
42+
43+
``gfql_index_all()`` is the one-liner. For finer control, build a single kind:
44+
45+
.. code-block:: python
46+
47+
g = g.create_index("edge_out_adj") # outgoing adjacency (forward hops)
48+
g = g.create_index("edge_in_adj") # incoming adjacency (reverse hops)
49+
g = g.create_index("node_id") # node-id lookup accelerator (unique ids only)
50+
51+
g.show_indexes() # inspect what's resident
52+
g = g.drop_index() # drop all (or drop_index("edge_out_adj"))
53+
54+
The index is a **sidecar over edge row positions** — it never reorders your ``.edges`` /
55+
``.nodes`` frames, and it is fingerprint-validated: rebinding ``.edges()`` safely
56+
invalidates a stale index (treated as absent, never a wrong answer).
57+
58+
Controlling the planner
59+
-----------------------
60+
61+
``gfql(..., index_policy=...)`` decides whether a resident index is used:
62+
63+
.. list-table::
64+
:header-rows: 1
65+
:widths: 18 82
66+
67+
* - ``index_policy``
68+
- Behavior
69+
* - ``'use'`` *(default)*
70+
- Use a resident index when one covers the query; never build one. Zero overhead if
71+
no index exists.
72+
* - ``'auto'``
73+
- Build an index on the fly when the planner predicts it pays off (selective seed set).
74+
* - ``'force'``
75+
- Require the index path (useful for benchmarking / asserting it is engaged).
76+
* - ``'off'``
77+
- Ignore indexes entirely (the plain ``O(E)`` scan).
78+
79+
Use ``g.gfql_explain(query, index_policy=...)`` to see whether the index path was taken.
80+
81+
The indexes are **engine-uniform**: numpy host arrays for pandas / Polars, cupy on-device
82+
for cuDF. They are also exposed as **Cypher DDL** (``CREATE GFQL INDEX FOR edge_out_adj``,
83+
``DROP GFQL INDEX``, ``SHOW GFQL INDEXES`` — the mandatory ``GFQL`` token distinguishes them
84+
from standard property ``CREATE INDEX``) and in the **JSON wire protocol**
85+
(``{"type": "CreateIndex", ...}`` ops plus ``index_policy`` in the request envelope), so a
86+
remote ``gfql_remote`` call can carry the same index intent.
87+
88+
Performance
89+
-----------
90+
91+
**Flat in graph size.** A seeded 1-hop stays sub-millisecond as the graph grows 10×, while
92+
the ``O(E)`` scan grows linearly. Synthetic power-law graphs, GFQL-pandas, warm median,
93+
every cell guarded so the index path was taken *and* the indexed result equals the scan
94+
result:
95+
96+
.. list-table::
97+
:header-rows: 1
98+
:widths: 40 30 30
99+
100+
* - Seeded 1-hop
101+
- 0.8M nodes / 6.4M edges
102+
- 8M nodes / 64M edges
103+
* - **Indexed (O(degree))**
104+
- **0.124 ms**
105+
- **0.122 ms** *(flat)*
106+
* - Scan (O(E))
107+
- 105 ms
108+
- 1045 ms
109+
110+
The same holds on real power-law graphs: a typical-seed 1-hop is ~0.13 ms on LiveJournal
111+
(35M edges) and ~0.14 ms on Orkut (117M edges), versus an ``O(E)`` scan of 367 ms → 1208 ms.
112+
113+
**Beats embedded graph databases on selective lookups.** Same graph (0.8M nodes / 6.4M
114+
edges), matched result counts, warm median. GFQL is CPU-pandas with the index; Kuzu and
115+
Neo4j use their native indexes:
116+
117+
.. list-table::
118+
:header-rows: 1
119+
:widths: 24 22 18 18 18
120+
121+
* - Task
122+
- GFQL (indexed)
123+
- Kuzu
124+
- Neo4j
125+
- GFQL speedup
126+
* - 1-hop seeded
127+
- **0.123 ms**
128+
- 1.15 ms
129+
- 1.45 ms
130+
- 9.4× / 11.8×
131+
* - 1–2-hop seeded
132+
- **0.150 ms**
133+
- 4.25 ms
134+
- 2.54 ms
135+
- 28× / 16.9×
136+
137+
On a fairer, fully-prepared, in-process Kuzu re-run (LiveJournal 35M), GFQL is still
138+
**17×** on a typical seed (0.126 ms vs 2.13 ms) and **** on a hub seed (3.76 ms vs
139+
22.6 ms). *(Kuzu's worst-case-optimal joins can win on cyclic / multi-way-join patterns —
140+
triangles, cliques — which these forward-expansion lookups do not exercise; we do not
141+
claim those.)*
142+
143+
**Selective traversal is CPU's game.** The indexed hop is tiny work, so the GPU's
144+
kernel-launch floor (~3 ms on cuDF) loses to a ~0.13 ms pandas / ~0.16 ms Polars
145+
``searchsorted`` — the clean inverse of *bulk* analytics, where the GPU pulls ahead
146+
(see :doc:`engines`). Pick the index for selective traversal and a **CPU engine** to
147+
drive it.
148+
149+
Reproduce: ``benchmarks/gfql/index_takeover_bench.py``,
150+
``benchmarks/gfql/index_vs_dbs.py``, ``benchmarks/gfql/index_vs_kuzu_prepared.py``.
151+
Hardware: DGX ``dgx-spark``, GB10 GPU.
152+
153+
Honesty and cost
154+
----------------
155+
156+
- **Build cost** is one ``O(E log E)`` sort, amortized over subsequent queries.
157+
``index_policy='auto'`` only builds when the planner predicts a selective query will
158+
pay it back.
159+
- **No change to default behavior.** With no index resident and ``index_policy='use'``
160+
(the default), queries run exactly as before.
161+
- **Parity-or-fallback.** The index accelerates the seeded scan sites it covers (forward /
162+
reverse hop, the Polars hop, the single-hop chain fast path). Any uncovered feature —
163+
edge / source / destination match, ``target_wave_front``, ``min_hops>1``, labeling —
164+
falls back to the scan/join path. The indexed subgraph is verified equal to the scan
165+
subgraph in differential tests across pandas / cuDF / Polars / Polars-GPU. It is an
166+
accelerator, never a source of a different answer.
167+
168+
See also
169+
--------
170+
171+
- :doc:`engines` — choosing pandas / Polars / cuDF / Polars-GPU for non-seeded work.
172+
- :doc:`performance` — the vectorization + GPU design behind GFQL.
173+
- :doc:`benchmark_filter_pagerank` — an end-to-end filter → PageRank → filter comparison vs Neo4j.

0 commit comments

Comments
 (0)