|
| 1 | +Seeded Traversal Indexes (CSR Adjacency) |
| 2 | +======================================== |
| 3 | + |
| 4 | +A **seeded** graph query starts from a known set of nodes — "the neighbors of these |
| 5 | +50 accounts", "2 hops out from this device" — rather than scanning the whole graph. |
| 6 | +By default GFQL answers a seeded ``hop`` with an ``O(E)`` pass over every edge. With an |
| 7 | +opt-in **CSR adjacency index**, the same hop becomes an ``O(degree)`` gather: its cost |
| 8 | +depends on how many edges the *seeds* touch, not on how big the graph is. The result is |
| 9 | +**flat in graph size** — and it beats embedded graph databases on selective lookups. |
| 10 | + |
| 11 | +Nothing changes about the answer. The index is a pay-as-you-go accelerator: a query either |
| 12 | +uses a resident index or falls back to the scan, and any feature the index does not cover |
| 13 | +also falls back — never a different result. |
| 14 | + |
| 15 | +When to use it |
| 16 | +-------------- |
| 17 | + |
| 18 | +- **Seeded traversals**: you start from specific node ids (a watchlist, a session, a fraud |
| 19 | + ring's known members) and hop out 1–3 steps. |
| 20 | +- **Repeated queries** against the same graph: build the index once, amortize it over many |
| 21 | + seeded lookups. |
| 22 | +- **Interactive / point-lookup latency**: sub-millisecond neighbor expansion. |
| 23 | + |
| 24 | +It does **not** help a full-graph scan (a property filter over every node, a global |
| 25 | +PageRank). For those, choose an *engine* instead — see :doc:`engines`. |
| 26 | + |
| 27 | +Quick start |
| 28 | +----------- |
| 29 | + |
| 30 | +.. code-block:: python |
| 31 | +
|
| 32 | + import graphistry |
| 33 | +
|
| 34 | + g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id") |
| 35 | +
|
| 36 | + # Build the indexes once (out+in adjacency, plus a node-id accelerator when ids are unique) |
| 37 | + g = g.gfql_index_all() |
| 38 | +
|
| 39 | + # Seeded query — the index is used automatically (default index_policy='use') |
| 40 | + out = g.gfql("MATCH (a)-[e]->(b) WHERE a.id IN $seeds RETURN a, e, b", |
| 41 | + params={"seeds": my_seed_ids}) |
| 42 | +
|
| 43 | +``gfql_index_all()`` is the one-liner. For finer control, build a single kind: |
| 44 | + |
| 45 | +.. code-block:: python |
| 46 | +
|
| 47 | + g = g.create_index("edge_out_adj") # outgoing adjacency (forward hops) |
| 48 | + g = g.create_index("edge_in_adj") # incoming adjacency (reverse hops) |
| 49 | + g = g.create_index("node_id") # node-id lookup accelerator (unique ids only) |
| 50 | +
|
| 51 | + g.show_indexes() # inspect what's resident |
| 52 | + g = g.drop_index() # drop all (or drop_index("edge_out_adj")) |
| 53 | +
|
| 54 | +The index is a **sidecar over edge row positions** — it never reorders your ``.edges`` / |
| 55 | +``.nodes`` frames, and it is fingerprint-validated: rebinding ``.edges()`` safely |
| 56 | +invalidates a stale index (treated as absent, never a wrong answer). |
| 57 | + |
| 58 | +Controlling the planner |
| 59 | +----------------------- |
| 60 | + |
| 61 | +``gfql(..., index_policy=...)`` decides whether a resident index is used: |
| 62 | + |
| 63 | +.. list-table:: |
| 64 | + :header-rows: 1 |
| 65 | + :widths: 18 82 |
| 66 | + |
| 67 | + * - ``index_policy`` |
| 68 | + - Behavior |
| 69 | + * - ``'use'`` *(default)* |
| 70 | + - Use a resident index when one covers the query; never build one. Zero overhead if |
| 71 | + no index exists. |
| 72 | + * - ``'auto'`` |
| 73 | + - Build an index on the fly when the planner predicts it pays off (selective seed set). |
| 74 | + * - ``'force'`` |
| 75 | + - Require the index path (useful for benchmarking / asserting it is engaged). |
| 76 | + * - ``'off'`` |
| 77 | + - Ignore indexes entirely (the plain ``O(E)`` scan). |
| 78 | + |
| 79 | +Use ``g.gfql_explain(query, index_policy=...)`` to see whether the index path was taken. |
| 80 | + |
| 81 | +The indexes are **engine-uniform**: numpy host arrays for pandas / Polars, cupy on-device |
| 82 | +for cuDF. They are also exposed as **Cypher DDL** (``CREATE GFQL INDEX FOR edge_out_adj``, |
| 83 | +``DROP GFQL INDEX``, ``SHOW GFQL INDEXES`` — the mandatory ``GFQL`` token distinguishes them |
| 84 | +from standard property ``CREATE INDEX``) and in the **JSON wire protocol** |
| 85 | +(``{"type": "CreateIndex", ...}`` ops plus ``index_policy`` in the request envelope), so a |
| 86 | +remote ``gfql_remote`` call can carry the same index intent. |
| 87 | + |
| 88 | +Performance |
| 89 | +----------- |
| 90 | + |
| 91 | +**Flat in graph size.** A seeded 1-hop stays sub-millisecond as the graph grows 10×, while |
| 92 | +the ``O(E)`` scan grows linearly. Synthetic power-law graphs, GFQL-pandas, warm median, |
| 93 | +every cell guarded so the index path was taken *and* the indexed result equals the scan |
| 94 | +result: |
| 95 | + |
| 96 | +.. list-table:: |
| 97 | + :header-rows: 1 |
| 98 | + :widths: 40 30 30 |
| 99 | + |
| 100 | + * - Seeded 1-hop |
| 101 | + - 0.8M nodes / 6.4M edges |
| 102 | + - 8M nodes / 64M edges |
| 103 | + * - **Indexed (O(degree))** |
| 104 | + - **0.124 ms** |
| 105 | + - **0.122 ms** *(flat)* |
| 106 | + * - Scan (O(E)) |
| 107 | + - 105 ms |
| 108 | + - 1045 ms |
| 109 | + |
| 110 | +The same holds on real power-law graphs: a typical-seed 1-hop is ~0.13 ms on LiveJournal |
| 111 | +(35M edges) and ~0.14 ms on Orkut (117M edges), versus an ``O(E)`` scan of 367 ms → 1208 ms. |
| 112 | + |
| 113 | +**Beats embedded graph databases on selective lookups.** Same graph (0.8M nodes / 6.4M |
| 114 | +edges), matched result counts, warm median. GFQL is CPU-pandas with the index; Kuzu and |
| 115 | +Neo4j use their native indexes: |
| 116 | + |
| 117 | +.. list-table:: |
| 118 | + :header-rows: 1 |
| 119 | + :widths: 24 22 18 18 18 |
| 120 | + |
| 121 | + * - Task |
| 122 | + - GFQL (indexed) |
| 123 | + - Kuzu |
| 124 | + - Neo4j |
| 125 | + - GFQL speedup |
| 126 | + * - 1-hop seeded |
| 127 | + - **0.123 ms** |
| 128 | + - 1.15 ms |
| 129 | + - 1.45 ms |
| 130 | + - 9.4× / 11.8× |
| 131 | + * - 1–2-hop seeded |
| 132 | + - **0.150 ms** |
| 133 | + - 4.25 ms |
| 134 | + - 2.54 ms |
| 135 | + - 28× / 16.9× |
| 136 | + |
| 137 | +On a fairer, fully-prepared, in-process Kuzu re-run (LiveJournal 35M), GFQL is still |
| 138 | +**17×** on a typical seed (0.126 ms vs 2.13 ms) and **6×** on a hub seed (3.76 ms vs |
| 139 | +22.6 ms). *(Kuzu's worst-case-optimal joins can win on cyclic / multi-way-join patterns — |
| 140 | +triangles, cliques — which these forward-expansion lookups do not exercise; we do not |
| 141 | +claim those.)* |
| 142 | + |
| 143 | +**Selective traversal is CPU's game.** The indexed hop is tiny work, so the GPU's |
| 144 | +kernel-launch floor (~3 ms on cuDF) loses to a ~0.13 ms pandas / ~0.16 ms Polars |
| 145 | +``searchsorted`` — the clean inverse of *bulk* analytics, where the GPU pulls ahead |
| 146 | +(see :doc:`engines`). Pick the index for selective traversal and a **CPU engine** to |
| 147 | +drive it. |
| 148 | + |
| 149 | +Reproduce: ``benchmarks/gfql/index_takeover_bench.py``, |
| 150 | +``benchmarks/gfql/index_vs_dbs.py``, ``benchmarks/gfql/index_vs_kuzu_prepared.py``. |
| 151 | +Hardware: DGX ``dgx-spark``, GB10 GPU. |
| 152 | + |
| 153 | +Honesty and cost |
| 154 | +---------------- |
| 155 | + |
| 156 | +- **Build cost** is one ``O(E log E)`` sort, amortized over subsequent queries. |
| 157 | + ``index_policy='auto'`` only builds when the planner predicts a selective query will |
| 158 | + pay it back. |
| 159 | +- **No change to default behavior.** With no index resident and ``index_policy='use'`` |
| 160 | + (the default), queries run exactly as before. |
| 161 | +- **Parity-or-fallback.** The index accelerates the seeded scan sites it covers (forward / |
| 162 | + reverse hop, the Polars hop, the single-hop chain fast path). Any uncovered feature — |
| 163 | + edge / source / destination match, ``target_wave_front``, ``min_hops>1``, labeling — |
| 164 | + falls back to the scan/join path. The indexed subgraph is verified equal to the scan |
| 165 | + subgraph in differential tests across pandas / cuDF / Polars / Polars-GPU. It is an |
| 166 | + accelerator, never a source of a different answer. |
| 167 | + |
| 168 | +See also |
| 169 | +-------- |
| 170 | + |
| 171 | +- :doc:`engines` — choosing pandas / Polars / cuDF / Polars-GPU for non-seeded work. |
| 172 | +- :doc:`performance` — the vectorization + GPU design behind GFQL. |
| 173 | +- :doc:`benchmark_filter_pagerank` — an end-to-end filter → PageRank → filter comparison vs Neo4j. |
0 commit comments