diff --git a/CHANGELOG.md b/CHANGELOG.md index 8dcb427b35..189ce1c134 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,9 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ## [Development] +### Documentation +- **GFQL engine-selection docs (pandas / polars / cuDF / polars-gpu)**: New :doc:`Choosing a GFQL Engine ` page — a numbers-first, persona-tested guide to the four interchangeable engines. Adds the one-keyword `engine='polars'` speedup (up to ~38× over pandas on real graphs, no GPU), a motivating warm-median comparison table on real public graphs (LiveJournal 35M / Orkut 117M), a decision matrix (workload shape × size × hardware → engine, with the measured ~10K-edge CPU crossover, GPU-work-bound rule, polars-gpu memory-pressure caveat, and GPU-or-error contract), a cuDF-vs-polars-gpu disambiguation (eager-op vs fused-lazy; cuDF is not deprecated), an honest "when *not* to use Polars" section, the differential-parity guarantee, and a methodology + reproducer-script disclosure. Rewrote the top of `gfql/performance.rst` to lead with the engine comparison (de-marketed the prose), wired the new page into the GFQL toctree + recommended paths, and added Polars/polars-gpu to the engine examples in `gfql/quick.rst` and `gfql/about.rst` (previously only pandas/cuDF were documented). Driven by 4-persona doc user-testing (pandas DS, RAPIDS user, perf engineer, skeptical evaluator). + ### Added - **GFQL polars execution config is Python-settable and live**: `set_cpu_streaming(bool)` and `set_gpu_executor('in-memory'|'streaming')` in `graphistry.compute.gfql.lazy` (plus the public `GPU_EXECUTORS` options and `GpuExecutor` type) set the CPU-streaming / GPU-executor knobs from Python. They resolve **Python override > environment variable > default**, read **live** per collect — previously these were env-only (`GFQL_POLARS_CPU_STREAMING` / `GFQL_POLARS_GPU_EXECUTOR`) and frozen at import, so neither a Python setting nor a post-import env change took effect. `None` resets a setter to env/default. - **GFQL engine conversion honors the `validate`/`warn` convention**: `Engine.df_to_engine(df, engine, *, validate=, warn=)` threads the repo-wide `validate` (`'strict'`/`'strict-fast'`/`'autofix'`; `True`→strict, `False`→autofix) + `warn` protocol into the pandas→polars and pandas→cuDF converters. On a mixed-type object column that Arrow/polars/cuDF cannot represent, `strict` raises (`NotImplementedError` for polars, `ArrowConversionError` for cuDF) and `autofix` coerces the column to string and warns — the same convention as `plot()`/`upload()`. Each engine keeps its established default (polars `strict` = parity-or-raise; cuDF `autofix` = its shipped best-effort coercion, now `warn`-suppressible). diff --git a/benchmarks/gfql/index_crossover_bench.py b/benchmarks/gfql/index_crossover_bench.py new file mode 100644 index 0000000000..4406fe5102 --- /dev/null +++ b/benchmarks/gfql/index_crossover_bench.py @@ -0,0 +1,64 @@ +#!/usr/bin/env python3 +"""Small-N pandas-vs-polars CROSSOVER bench (CPU). Answers "where does polars start +beating pandas?" per workload SHAPE, on a real graph subsampled to N edges. + +The crossover is shape-dependent: row-pipeline shapes (filter / WHERE+ORDER) cross over +much earlier than traversal (chain orchestration is the residual small-N fixed cost). +CPU only (the crossover question is pandas-CPU vs polars-CPU); no GPU needed. + +Env: PARQUET=/data/edges.parquet EDGES=10000,100000,1000000 REPS=15 WARM=3 OUT=/tmp/x.jsonl +""" +from __future__ import annotations +import json, os, statistics, time +import numpy as np +import pandas as pd +import graphistry +from graphistry.compute.ast import n, e_forward + + +def med(fn, reps, warm): + for _ in range(warm): + fn() + ts = [] + for _ in range(reps): + t = time.perf_counter(); fn(); ts.append((time.perf_counter() - t) * 1e3) + ts.sort() + return statistics.median(ts) + + +def main(): + edf_full = pd.read_parquet(os.environ["PARQUET"]).astype({"src": np.int64, "dst": np.int64}) + sizes = [int(x) for x in os.environ.get("EDGES", "10000,100000,1000000").split(",")] + reps = int(os.environ.get("REPS", "15")); warm = int(os.environ.get("WARM", "3")) + outf = open(os.environ["OUT"], "a") if os.environ.get("OUT") else None + print(f"{'shape':10} {'edges':>9} {'pandas_ms':>10} {'polars_ms':>10} {'polars_speedup':>15}") + for E in sizes: + edf = edf_full.head(E).reset_index(drop=True) + nodes = np.unique(np.concatenate([edf["src"].values, edf["dst"].values])) + ndf = pd.DataFrame({"id": nodes, "val": (nodes % 100).astype(np.int64)}) + g = graphistry.nodes(ndf, "id").edges(edf, "src", "dst") + seeds = nodes[: max(1, len(nodes) // 100)].tolist() # ~1% frontier + shapes = { + "filter": lambda eng: g.gfql([n({"val": 50})], engine=eng), + "hop1": lambda eng: g.gfql([n({"id": seeds}), e_forward()], engine=eng), + "where_ord": lambda eng: g.gfql( + "MATCH (a) WHERE a.val > 50 RETURN a.id ORDER BY a.id LIMIT 100", engine=eng), + } + for name, fn in shapes.items(): + try: + rp = fn("pandas"); rl = fn("polars") # warm + sanity + pm = med(lambda: fn("pandas"), reps, warm) + lm = med(lambda: fn("polars"), reps, warm) + sp = pm / lm if lm else float("nan") + print(f"{name:10} {E:>9} {pm:>10.3f} {lm:>10.3f} {('polars '+format(sp,'.2f')+'x') if sp>=1 else ('PANDAS '+format(1/sp,'.2f')+'x'):>15}") + if outf: + outf.write(json.dumps(dict(shape=name, edges=E, pandas_ms=pm, polars_ms=lm, + polars_speedup=sp)) + "\n"); outf.flush() + except Exception as ex: + print(f"{name:10} {E:>9} FAILED {type(ex).__name__}: {ex}") + if outf: + outf.close() + + +if __name__ == "__main__": + main() diff --git a/docs/source/gfql/about.rst b/docs/source/gfql/about.rst index fa57ea92e7..bfb32d6c24 100644 --- a/docs/source/gfql/about.rst +++ b/docs/source/gfql/about.rst @@ -27,7 +27,7 @@ GFQL fills a critical gap in the data community by providing an in-process, high **Key Benefits:** -- **Dataframe-Native:** Works directly with Pandas, cuDF, and other dataframe libraries. +- **Dataframe-Native:** Works directly with Pandas, Polars, cuDF, and other dataframe libraries. - **High Performance:** Optimized for both CPU and GPU execution. - **Ease of Use:** No need for external databases or new infrastructure. - **Interoperability:** Integrates with the Python data science ecosystem, including PyGraphistry for visualization. @@ -372,21 +372,30 @@ GFQL is optimized for GPU acceleration using ``cudf`` and ``rapids``. When using - GFQL detects ``cudf`` dataframes and runs the query on the GPU. - Achieves significant performance improvements on large datasets. -7. Forcing GPU Mode -~~~~~~~~~~~~~~~~~~~~ +7. Selecting an Engine (CPU and GPU) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can explicitly set the engine to ensure GPU execution. +You can explicitly set the execution engine. The same query returns identical +results on every engine — see :doc:`Choosing an Engine `. -**Example: Force GFQL to use GPU engine** +**Example: CPU columnar speedup (no GPU)** :: - g_result = g_gpu.gfql([ ... ], engine='cudf') + g_result = g.gfql([ ... ], engine='polars') # up to ~38x over pandas on real graphs + +**Example: Force GFQL to use a GPU engine** + +:: + + g_result = g_gpu.gfql([ ... ], engine='cudf') # NVIDIA GPU, eager + g_result = g_gpu.gfql([ ... ], engine='polars-gpu') # NVIDIA GPU, fused plan **Explanation:** -- ``engine='cudf'`` forces the use of the GPU-accelerated engine. -- Useful when you want to ensure the query runs on the GPU. +- ``engine='polars'`` runs the columnar CPU engine — the biggest win without a GPU. +- ``engine='cudf'`` / ``'polars-gpu'`` force GPU-accelerated execution. +- Useful when you want to ensure the query runs on a specific engine. Integration with PyData Ecosystem --------------------------------- diff --git a/docs/source/gfql/benchmark_filter_pagerank.rst b/docs/source/gfql/benchmark_filter_pagerank.rst index 174fa893ab..0d5e6ebba1 100644 --- a/docs/source/gfql/benchmark_filter_pagerank.rst +++ b/docs/source/gfql/benchmark_filter_pagerank.rst @@ -30,7 +30,10 @@ no database required. This benchmark compares **Graphistry's local Cypher** - **3.33s** - **>56x** -*Warm median of 5 runs, 2 warmup iterations. DGX dgx-spark, GB10 GPU.* +*Pipeline time (search + PageRank + search), warm median of 5 runs, 2 warmup iterations. DGX +dgx-spark, GB10 GPU. The per-graph sections below report full-lifecycle totals that also include +one-time ETL/load — hence the slightly larger numbers there (e.g. GPlus GPU 3.33s pipeline vs +~7.1s lifecycle).* The pipeline ------------ @@ -173,8 +176,23 @@ pandas / cuDF). That is what makes the CPU-to-GPU switch a configuration flag (``engine="cudf"``) rather than a rewrite, and what keeps ETL, search, and analytics in the same in-process pipeline. +**Same answer on every engine.** The CPU and GPU results above are not just +comparable — they are *identical*. Differential parity across ``pandas`` / +``polars`` / ``cudf`` / ``polars-gpu`` is a GFQL release gate: an engine either +returns the same result or raises ``NotImplementedError`` — never a silently +different answer. So the speedups here are a pure hardware/engine choice, not a +change in what the query means. + +This page is one workload (a filter → PageRank → filter pipeline) against one +external baseline (Neo4j+GDS). For the full four-engine picture — when Polars +beats pandas on CPU, when the GPU pulls ahead, and how to choose — see +:doc:`engines`. For sub-millisecond *seeded* lookups that beat Kuzu and Neo4j +by 9–28×, see :doc:`index_adjacency`. + For more on the GFQL design and supported surface: +- :doc:`engines` — choosing pandas / Polars / cuDF / Polars-GPU +- :doc:`index_adjacency` — seeded-traversal CSR adjacency index - :doc:`cypher` — Cypher syntax through ``g.gfql("MATCH ...")`` - :doc:`overview` — GFQL design, features, and GPU acceleration - :doc:`about` — 10-minute introduction to GFQL diff --git a/docs/source/gfql/engines.rst b/docs/source/gfql/engines.rst new file mode 100644 index 0000000000..dae255f51f --- /dev/null +++ b/docs/source/gfql/engines.rst @@ -0,0 +1,596 @@ +.. _gfql-engines: + +Choosing a GFQL Engine: pandas, Polars, cuDF, Polars-GPU +======================================================== + +GFQL runs the **same query** on four interchangeable execution engines. You pick +the engine with one keyword — ``engine=``, accepted uniformly by ``g.gfql()`` and +``g.hop()`` — and GFQL returns **identical results** on every one (differential parity +is a release gate). Pick the engine that fits your hardware and workload; nothing else changes. + +.. note:: + **New to GFQL?** This page assumes you already have a graph ``g`` and a ``query``. If not, + build one first — see :doc:`about` (10 Minutes to GFQL). + +The one-line speedup +-------------------- + +On real graphs, switching the default ``pandas`` engine to the columnar **Polars** +engine is a one-keyword change — no GPU, same results: + +.. doc-test: skip + +.. code-block:: python + + import graphistry + g = graphistry.edges(df, 'src', 'dst') # df: your edges dataframe (pandas / Polars / cuDF) + query = "MATCH (a)-[e]->(b) RETURN b" # any GFQL / Cypher query + + g.gfql(query) # engine='pandas' (default) + g.gfql(query, engine='polars') # up to ~38x faster on real graphs, no GPU, identical results + +Your existing pandas, Polars, or cuDF graph works as-is: the input frames are accepted and +coerced once; the only change is the keyword. The catch: a few exotic Cypher features still +require ``engine='pandas'`` (they raise rather than silently bridge), and the GPU engines only +pay off on larger work. On CPU, Polars wins the common graph-query shapes (traversal, +``WHERE``/``ORDER``, aggregation) from ~10K edges up — see *When not to use Polars* below. + +.. warning:: + **Already a Polars user? Pass** ``engine='polars'`` **— the default does not.** With the + default ``engine='auto'``, a graph built from ``polars.DataFrame`` is **silently coerced to + pandas** (``auto`` resolves to ``cudf`` for cuDF input and ``pandas`` for everything else, + *including Polars*; it never selects the Polars engine). To stay native end-to-end, pass + ``engine='polars'`` explicitly: + + .. code-block:: python + + import polars as pl, graphistry + g = graphistry.edges(edges_pl, 'src', 'dst').nodes(nodes_pl, 'id') # polars frames + out = g.gfql(query) # auto -> coerced to PANDAS (out._nodes is pandas!) + out = g.gfql(query, engine='polars') # native Polars in and out (out._nodes is polars) + +.. note:: + **Result frames match the engine.** With ``engine='polars'`` or ``'polars-gpu'`` the + output is Polars — ``result._nodes`` and ``result._edges`` are ``polars.DataFrame`` (and + ``cudf.DataFrame`` for ``engine='cudf'``). If downstream code is pandas-specific (``.iloc``, + ``.loc``, ``groupby().apply()``), call ``result._nodes.to_pandas()`` to convert back. + +The four engines +---------------- + +.. list-table:: + :header-rows: 1 + :widths: 16 14 18 12 40 + + * - Engine + - Hardware + - Frame type + - Opt-in? + - In one line + * - ``pandas`` + - CPU + - ``pandas`` + - default + - Universal default; best on small/interactive graphs. + * - ``polars`` + - CPU + - ``polars`` + - explicit + - Columnar + fused lazy plan; the CPU speed win, **no GPU needed**. + * - ``cudf`` + - NVIDIA GPU + - ``cudf`` + - explicit + - RAPIDS GPU, eager op-by-op; great for one very large materialization. + * - ``polars-gpu`` + - NVIDIA GPU + - ``polars`` + - explicit + - The Polars fused plan executed on GPU (cudf_polars); fastest on heavy multi-hop. + +``engine='auto'`` resolves to ``cudf`` for cuDF input and ``pandas`` otherwise. **AUTO +never selects Polars or Polars-GPU** — they are explicit opt-in (see *Why opt-in?* below). + +Motivating comparison (real graphs) +----------------------------------- + +Same query, same answers, four engines. Warm-median latency on **Orkut** (3.1M nodes / +**117M edges**, SNAP), measured on a single machine: + +.. list-table:: + :header-rows: 1 + :widths: 34 16 16 16 16 + + * - Workload (Orkut, 117M edges) + - ``pandas`` + - ``polars`` + - ``cudf`` + - ``polars-gpu`` + * - 1-hop from 10K seeds + - 2613 ms + - **68 ms** + - 1005 ms + - 63 ms + * - 2-hop from 10K seeds + - 18161 ms + - 2695 ms + - 2774 ms + - **1518 ms** + * - Full out-degree aggregation + - 799 ms + - 205 ms + - 314 ms + - **167 ms** + * - 2-hop from 100K seeds (~85M output rows) + - 28822 ms + - 8215 ms + - **6002 ms** + - 8559 ms + +*Warm median, identical result rows across all four engines. Reproducer:* +``benchmarks/gfql/index_bulk_olap_bench.py``. *See Methodology below.* + +Reading the table: + +- **Polars-CPU beats pandas up to ~38x** on bulk traversal and ~4x on aggregation — **with no + GPU**. On the 1-hop workload it is ~38x faster than pandas (68 ms vs 2613 ms). +- **Polars-CPU also beats cuDF** on these shapes (68 ms vs 1005 ms on 1-hop). cuDF runs + GFQL *eagerly*, op by op (a kernel launch + a materialized intermediate per hop), while + Polars builds **one fused lazy plan and collects once**. The fused plan wins until the + work is large enough to amortize GPU launch costs. +- **Polars-GPU is fastest on heavy multi-hop** (2-hop from 10K seeds: 1518 ms) and on + aggregation — the same fused plan, executed on the GPU. +- **cuDF wins the one extreme case** — a 2-hop from 100K seeds materializing ~85M output rows + (6.0 s) — where raw GPU throughput on a single massive join overtakes everything and + Polars-GPU comes under memory pressure (footnote F3). +- On a smaller graph (**LiveJournal**, 35M edges) the pattern holds: 1-hop from 10K seeds is + pandas 1129 ms → polars **37 ms** (~30x). Filter- and lookup-heavy workloads favor Polars + even more strongly — a separate **LDBC SNB sf1** benchmark shows order-of-magnitude gains + (tens of × over pandas; see ``benchmarks/gfql/`` and the GFQL benchmark notes). + +.. note:: + Route by workload shape and size (next section). **CPU Polars wins the common graph-query + shapes from ~10K edges up** — on LiveJournal subsampled (CPU, warm-median): 1-hop traversal + 2.7× / 4.5× / 7.6× and ``WHERE``+``ORDER`` 3.0× / 3.0× / 18× over pandas at 10K / 100K / 1M. + The **GPU** engines (cuDF / Polars-GPU) are the ones with a real small-size floor — they need + enough work to amortize kernel-launch cost (work-bound, [F2]). The only case pandas edges out + is a trivial sub-millisecond operation (e.g. a bare node-equality filter), where its boolean + mask beats Polars' plan overhead — but at <1 ms the difference is immaterial. Reproducer: + ``benchmarks/gfql/index_crossover_bench.py``. + +.. _gfql-vs-external-tools: + +GFQL vs external graph tools +---------------------------- + +GFQL is **dataframe-native**: ``pip install``, then query your existing pandas / Polars / +cuDF frame in-process — no separate database to stand up, no ETL to load, no cluster. Graph +databases (Neo4j, Kuzu) are a **system-of-record** you provision and ingest into first. The +table below is deliberately conservative: every speedup is stated with its condition, ``>`` +and did-not-finish markers are kept, and where we have no head-to-head we say **not +benchmarked** rather than guess. + +.. list-table:: + :header-rows: 1 + :widths: 14 22 30 34 + + * - Tool + - What it is / Setup + - Where GFQL wins (with condition) + - Where it complements / GFQL doesn't claim + * - **Neo4j + GDS** + - Server + GDS library; stand up a DB and ETL your data in. + - **Filter→PageRank→filter pipeline**, dgx-spark GB10, warm median: Twitter 2.4M — + 13.83 s Neo4j vs 2.55 s GFQL-CPU / **0.30 s GFQL-GPU (46×)**; GPlus 30M — + **>187 s (did-not-finish)** vs 75.78 s CPU / **3.33 s GPU (>56×)**. + - Neo4j remains the transactional system-of-record; run the read-heavy analytics in + GFQL. See :doc:`benchmark_filter_pagerank`. + * - **Kuzu** + - Embedded graph DB; still a separate store to load + index. + - **Seeded index lookup** (0.8M nodes / 6.4M edges): 1-hop **0.123 ms vs 1.15 ms + (9.4×)**, 2-hop **0.150 ms vs 4.25 ms (28×)**; prepared-Kuzu LiveJournal 35M ~ **17×** + typical seed, 6× hub. **Bulk frontier expansion** (LiveJournal 35M, 1-hop, many + seeds): **22× Kuzu**, up to **87× at k=100k**. See :doc:`index_adjacency`. + - **Not claimed:** cyclic / multi-way-join patterns (triangles, cliques) where Kuzu's + worst-case-optimal joins can win. Use Kuzu as the store; GFQL for bulk read analytics. + * - **igraph** + - Pure-Python/C graph library. + - — (not a standalone competitor here) + - **Complement, not competitor:** igraph is the CPU PageRank backend *inside* GFQL. + No head-to-head benchmarked. + * - **networkx** + - Pure-Python graph library; the floor most analysts start from. + - **not benchmarked** — expect order-of-magnitude headroom qualitatively (no measured + head-to-head). + - Fine for small/interactive graphs; GFQL is the columnar/GPU path when they grow. + * - **Spark GraphFrames** + - *Distributed* graph engine on a Spark cluster; provision + tune the cluster. + - GFQL is *single-node* (CPU or one GPU): 100M+ edges in-process on **one machine**, + no cluster to stand up, interactive latency — and a single GPU often matches or beats + a Spark cluster on read-heavy traversal + PageRank at a fraction of the cost. + *Head-to-head not yet published.* + - Reach for GraphFrames when the graph genuinely exceeds one machine's memory. Motif / + triangle / multi-way-join queries **run** in GFQL but are not yet perf-benchmarked. + * - **PuppyGraph** + - Graph query layer *over your warehouse tables in place* (zero-ETL, query pushdown). + - GFQL adds GPU/CPU graph **analytics PuppyGraph does not offer — PageRank, centrality, + community** — on a pulled subgraph, in one pipeline. *No head-to-head yet.* + - **Complement:** use PuppyGraph for ad-hoc graph SQL across the whole warehouse; pull the + relevant subgraph into GFQL when you need GPU-accelerated analytics on it. + +GFQL **complements** a graph database more than it replaces one: keep Neo4j or Kuzu as the +system-of-record, and do the read-heavy search + analytics in GFQL so ETL, traversal, and +scoring stay in one in-process dataframe pipeline. Route by shape — **selective** seeded +lookups favor the GFQL index (up to 28× Kuzu, 16.9× Neo4j on 2-hop), **bulk** frontier +expansion and full pipelines favor Polars / GPU (22–87× Kuzu; **46–56× Neo4j** on the +filter→PageRank→filter pipeline). Against the **distributed** engines the axis is different: +GFQL trades horizontal scale-out for zero cluster/warehouse setup and interactive latency — +choose it below the single-machine ceiling (100M+ edges fit in-process; a cluster is only +needed once the graph genuinely exceeds one node's memory), and complement PuppyGraph's +zero-ETL warehouse graph with GFQL's GPU analytics. The one case we explicitly **do not** +claim is cyclic / multi-way-join patterns (triangles, cliques): they **run**, but Kuzu's +worst-case-optimal joins can beat a dataframe plan there and we have not yet perf-tuned them. + +Decision matrix +--------------- + +.. list-table:: + :header-rows: 1 + :widths: 30 16 18 22 14 + + * - Workload shape + - Size (edges) + - Hardware + - Recommended engine + - Notes + * - Filter / ``WHERE`` / aggregation + - > ~10K + - CPU + - ``polars`` + - wins from ~10K; gap grows with size (up to order-of-magnitude) [F1] + * - Bulk 1-hop frontier expansion + - > ~10K + - CPU + - ``polars`` + - wins from ~10K (2.7x); up to ~38x pandas, ~15x cuDF at 100M [F1] + * - Heavy multi-hop (2-hop+) + - large + - GPU + - ``polars-gpu`` + - fastest until extreme materialization [F3]; GPU-or-error [F4] + * - Full-graph aggregation + - 100M+ + - GPU + - ``polars-gpu`` / ``cudf`` + - GPU work-bound [F2] + * - One very large single materialization + - 80M+ output rows + - GPU + - ``cudf`` + - Polars-GPU can hit memory pressure here [F3] + * - Trivial sub-ms op (bare equality filter) + - any + - CPU + - ``pandas`` + - boolean mask beats Polars plan overhead; immaterial (<1 ms) [F1] + * - Selective / seeded traversal + - any + - CPU + - ``pandas``/``polars`` + **CSR index** + - O(degree), not an engine choice [F5] + +**[F1] CPU crossover is ~10K, not ~1M.** For the common graph-query shapes (traversal, +``WHERE``/``ORDER``, aggregation) CPU Polars beats pandas from ~10K edges up (2.7-18× in our +runs). Pandas only edges out on a trivial sub-millisecond operation (a bare equality mask), +where the absolute difference is immaterial. The real small-size floor is **GPU-only** — +cuDF / Polars-GPU need enough work to amortize kernel launch ([F2]). + +**[F2] GPU is work-bound, not size-bound.** A GPU wins when there is enough work to amortize +its ~3 ms kernel-launch floor: big frontiers, dense joins, full-graph aggregation. Tiny or +seeded work finishes faster on CPU. + +**[F3] Polars-GPU memory pressure.** On an extreme single materialization (~85M output rows, +2-hop from 100K seeds on Orkut) raw ``cudf`` leads (6.0 s) and ``polars-gpu`` slips (8.6 s) +as its in-memory GPU executor comes under memory pressure. Prefer ``cudf`` for that regime. + +**[F4] Polars-GPU is GPU-or-error.** It never silently falls back to CPU and reports the +result as a GPU run (see *Honesty* below). + +**[F5] Selective traversal is an indexing problem, not an engine choice.** A seeded ``hop`` +from a few nodes is fastest with the opt-in **CSR adjacency index** (``g.gfql_index_all()`` / +``g.create_index(...)``, ``index_policy=``), which turns the O(E) scan into an O(degree) +gather — flat in graph size, and 9–28× faster than Kuzu / Neo4j on selective lookups. It works +on all four engines, but seeded work is so small that **CPU wins**: on LiveJournal 35M a +typical-seed 1-hop is ~0.13 ms on pandas and ~0.16 ms on Polars (numpy ``searchsorted``) vs +~3 ms on cuDF (GPU kernel-launch floor) — the clean inverse of bulk, where the GPU pulls +ahead. So pick the index for selective traversal and a CPU engine to drive it. See +:doc:`index_adjacency` for the full guide. + +Switching engines +----------------- + +The engine is a single keyword on ``g.gfql()`` (and ``g.hop()``). The graph and +the query never change — only ``engine=`` does, and the answer stays identical +(or raises ``NotImplementedError`` rather than silently changing it). + +.. code-block:: python + + import graphistry + g = graphistry.edges(df, 'src', 'dst') # your existing graph (any frame type) + query = "MATCH (a)-[e]->(b) RETURN b" # any GFQL / Cypher query + + g.gfql(query) # engine='pandas' (default) + g.gfql(query, engine='polars') # CPU columnar, no GPU, identical results + g.gfql(query, engine='cudf') # NVIDIA GPU (RAPIDS) + g.gfql(query, engine='polars-gpu') # same fused plan on GPU + +Getting results back as pandas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The result's ``._nodes`` / ``._edges`` come back in the engine's frame type: a +``polars.DataFrame`` for ``'polars'`` / ``'polars-gpu'``, a ``cudf.DataFrame`` +for ``'cudf'``. When downstream code is pandas-only (matplotlib, scikit-learn, +``.iloc`` / ``groupby().apply()``), convert once with ``.to_pandas()``: + +.. code-block:: python + + out = g.gfql(query, engine='polars') # or 'cudf' / 'polars-gpu' + nodes_pd = out._nodes.to_pandas() # -> pandas for matplotlib / sklearn / ... + nodes_pd.plot.scatter(x='x', y='y') # pandas-only downstream code, unchanged + +Mixing engines +~~~~~~~~~~~~~~~ + +The build frame type and the run engine are independent — GFQL coerces the input +frames to the engine you ask for. A pandas graph runs on ``engine='polars'``, a +Polars graph runs on ``engine='pandas'``, and so on. The only cost is a +**one-time convert** of the input frames at the start of the call; the query then +runs fully on the chosen engine. Note that ``engine='auto'`` (the default) +resolves to ``cudf`` for cuDF input and ``pandas`` for everything else — **it +never selects Polars or Polars-GPU**, so those two are always an explicit opt-in. + +.. tip:: + For selective, seeded traversal, build the CSR adjacency index once with + ``g.gfql_index_all()`` (or ``index_policy=``) — it works on all four engines + and turns the O(E) scan into an O(degree) gather. See :doc:`index_adjacency`. + +.. _gfql-offengine-calls: + +Analytics under Polars (``umap`` / ``hypergraph`` / ``compute_cugraph`` …) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A GFQL ``call()`` that runs a **whole-graph analytic** — ``umap``, ``hypergraph``, +``compute_cugraph`` / ``compute_igraph``, the ``*_layout`` ops, ``collapse`` — has +**no native Polars implementation** (these wrap pandas / cuDF / GPU libraries and +always will). Under ``engine='polars'`` / ``'polars-gpu'`` GFQL runs them as a +**mode-gated, off-engine modality switch** rather than declining outright: + +- **``call_mode='auto'`` (the default):** the analytic runs off-engine — on + **pandas** for ``polars``, on **cuDF (on device)** for ``polars-gpu`` — and its + result is coerced back to Polars **losslessly** (via Arrow). A one-time + ``RuntimeWarning`` per analytic notes the off-engine run. ``polars-gpu`` is + **GPU-or-error**: it bridges to cuDF and *declines* if the GPU/cuDF stack is + missing (it never silently drops a GPU analytic to host pandas). +- **``call_mode='strict'``:** decline with ``NotImplementedError`` instead of + bridging — for benchmark integrity (no hidden modality switch attributed to the + Polars engine) or a hard memory ceiling. + +.. note:: + **Memory on a very large graph.** The bridge materializes a copy of the graph in + the off-engine format — pandas (host) for ``polars``, cuDF (device / unified + memory) for ``polars-gpu``. That transient copy is the *same* allocation you'd + incur running the analytic on ``engine='cudf'`` directly, so GFQL does **not** add + a per-call size cap (a row count is a poor memory proxy, and the real cap belongs + at the RMM / container / deployment layer). For a graph large enough that the copy + is a concern, either set ``call_mode='strict'`` (decline the bridge) or run the + analytic under an RMM device-memory limit / container memory limit, exactly as you + would for any cuDF workload. + +This is **deliberately narrower** than traversal / filter / row ops (``hop``, +``WHERE``, ``RETURN`` …), which stay **parity-or-``NotImplementedError``** and are +never bridged — a bridge there would hide a missing native impl and misreport +pandas performance as Polars. Set the mode from Python or the environment (live, +Python override > env > default): + +.. doc-test: skip + +.. code-block:: python + + from graphistry.compute.gfql.lazy import set_call_mode, CALL_MODES # ('auto', 'strict') + + set_call_mode('strict') # decline off-engine analytics (pass None to reset to env/default) + # or: export GFQL_POLARS_CALL_MODE=strict + +cuDF vs Polars-GPU +------------------ + +Both run on an NVIDIA GPU, so which do you use? + +- **cuDF is not deprecated.** It remains a first-class, supported engine and is the right + choice for one very large materialization (footnote F3). +- **They execute differently.** ``cudf`` runs GFQL eagerly — each hop is a separate kernel + launch with a materialized intermediate. ``polars-gpu`` runs the **same fused lazy plan as + the CPU Polars engine**, collected once on the GPU. Fusing the plan is why ``polars-gpu`` + leads on heavy multi-hop and why even **CPU Polars often beats eager cuDF** on bulk work. +- **Frame type.** ``cudf`` operates on ``cudf.DataFrame``; ``polars-gpu`` operates on + ``polars.DataFrame`` (only the lazy ``.collect()`` runs on the GPU). Either way, a graph + built from pandas frames is accepted and coerced for you — only the keyword changes. +- **Install.** ``cudf`` and ``polars-gpu`` both need the RAPIDS GPU stack; ``polars-gpu`` + additionally uses ``cudf_polars``. ``polars`` (CPU) only needs ``pip install polars``. + +.. _gfql-larger-than-memory: + +Larger-than-memory: streaming execution +--------------------------------------- + +The default Polars engines run **in-memory**: fastest and most stable while the +graph and its query intermediates fit in RAM (or device memory). When a query's +*intermediates* would blow past memory — a wide multi-hop frontier, a large +join, a big aggregation — GFQL has two **opt-in** streaming modes that trade a +little latency for a much larger working set: + +.. list-table:: + :header-rows: 1 + :widths: 22 20 58 + + * - Mode + - Engine + - What it does + * - ``GFQL_POLARS_CPU_STREAMING=1`` + - ``polars`` + - Collects the fused plan with Polars' **streaming engine** — processes in + batches and **spills to disk**, so intermediates can exceed RAM. + * - ``GFQL_POLARS_GPU_EXECUTOR=streaming`` + - ``polars-gpu`` + - Uses the **cudf-polars streaming executor** — the escape hatch for + results **larger than device memory** (the default in-memory executor + would OOM). + +Both are **off by default** on purpose: they add overhead that *regresses* +small/interactive work (~0.86× at 100K edges), and for the in-memory regime this +page measures, the default is faster and more stable. Results are +**parity-identical** to the default — streaming changes *how* the plan runs, not +*what* it returns. + +Set them by environment variable: + +.. code-block:: bash + + # CPU: batched + disk-spill for larger-than-RAM intermediates + export GFQL_POLARS_CPU_STREAMING=1 + + # GPU: streaming executor for larger-than-device-memory results + export GFQL_POLARS_GPU_EXECUTOR=streaming + +...or from Python at runtime — the setting is read **live** (per collect), and a Python +override takes precedence over the environment variable: + +.. doc-test: skip + +.. code-block:: python + + from graphistry.compute.gfql.lazy import ( + set_cpu_streaming, set_gpu_executor, GPU_EXECUTORS, + ) + + set_cpu_streaming(True) # CPU streaming collect (pass None to reset to env/default) + set_gpu_executor('streaming') # one of GPU_EXECUTORS == ('in-memory', 'streaming') + +Then use ``engine='polars'`` / ``engine='polars-gpu'`` exactly as before — no code +change: + +.. doc-test: skip + +.. code-block:: python + + import graphistry # env vars above must be set first + g = graphistry.edges(edges_df, 'src', 'dst') + result = g.gfql(query, engine='polars') # streaming collect (CPU, disk-spill) + # result = g.gfql(query, engine='polars-gpu') # streaming executor (GPU) + +.. note:: + **What streaming does and does not cover today.** These flags stream the + **query** (collect), which helps when the *input fits but the intermediates or + result do not*. They do **not** yet give out-of-core *input*: ``graphistry`` + currently materializes edge/node frames at ingestion (a passed + ``polars.LazyFrame`` is collected immediately), so the source graph must still + fit in memory. True out-of-core-from-disk — building GFQL directly on a lazy + ``pl.scan_parquet`` source so a graph larger than RAM never fully materializes — + is **work in progress**; see the Friendster (~1.8B edges) discussion in the + GraphFrames benchmark page. + +When **not** to use Polars +-------------------------- + +Honesty matters more than a bigger number: + +- **Trivial sub-millisecond operations** (a bare node-equality filter): pandas' boolean mask + beats Polars' plan overhead — but at <1 ms it is immaterial. For traversal / ``WHERE`` / + ``ORDER`` / aggregation, CPU Polars wins from ~10K edges up (footnote F1). The real small-size + caveat is **GPU-only** (cuDF / Polars-GPU need larger work — footnote F2). +- **A few exotic Cypher features** are not yet native on Polars (e.g. cross-entity same-path + ``WHERE``, some temporal/entity-text forms). They raise an honest ``NotImplementedError`` + pointing at ``engine='pandas'`` — GFQL **never** silently bridges Polars to pandas, because + that would misreport pandas performance as Polars (see *Honesty*). +- **One extreme materialization (80M+ output rows):** prefer ``cudf`` over ``polars-gpu`` + (footnote F3). +- **vs graph databases:** GFQL-Polars beats embedded kuzu on frontier expansion (up to ~87x + on LiveJournal 1-hop in our runs — reproducer ``benchmarks/gfql/index_vs_kuzu_prepared.py``), + and separately beats Neo4j+GDS end-to-end (:doc:`benchmark_filter_pagerank`). The honest + boundary: kuzu's worst-case-optimal joins target **cyclic / multi-way join** patterns + (triangles, cliques) that we have **not** yet benchmarked, and kuzu may lead there. + +Parity and honesty +------------------ + +- **Identical results across engines.** Differential parity — every engine's output must match + the pandas oracle — is a release gate, exercised across forward/reverse/undirected, 1-3 hop, + filters, and aggregations. +- **No silent fallback for traversal / filter / row ops — parity-verified.** For ``hop`` / + ``WHERE`` / ``RETURN`` / aggregation, the Polars engine runs natively or raises + ``NotImplementedError`` — it never quietly converts to pandas, so a *traversal* latency you + measure is real work on the engine you asked for. ``polars-gpu`` is **GPU-or-error**: if any + step of the plan cannot run on the GPU it raises (pointing at ``engine='polars'``) rather than + silently running on CPU and labelling it a GPU result. +- **Whole-graph analytics are the one mode-gated exception.** ``umap`` / ``hypergraph`` / + ``compute_cugraph`` and friends have no Polars kernel; under ``call_mode='auto'`` (default) + they run off-engine and warn once (see + :ref:`Analytics under Polars `). This is *not* silent — it warns — and + ``call_mode='strict'`` restores strict parity-or-decline for benchmark integrity, so a + benchmarked run can guarantee no hidden modality switch. + +Methodology +----------- + +- Host: ``dgx-spark`` (GB10 Grace-Blackwell, unified memory — the F3 memory-pressure + boundary is partly a property of this box), RAPIDS container + ``graphistry/test-rapids-official:26.02-gfql-polars``. +- Datasets: `SNAP `_ **com-LiveJournal** (35M edges), + **com-Orkut** (117M edges). The order-of-magnitude filter/lookup figure is from a separate + **LDBC SNB sf1** benchmark, not the table above. +- Measurement: **warm median** after 2 warmups (5 timed runs on Orkut, 8 on LiveJournal); + every reported cell is **guarded** — the result rows are verified identical across engines + before any timing is kept. +- Reproduce: ``benchmarks/gfql/index_bulk_olap_bench.py`` (engine comparison), + ``benchmarks/gfql/pandas_vs_polars.py``, and ``benchmarks/gfql/index_vs_kuzu_prepared.py`` + (vs kuzu). Numbers on this page are rendered from saved runs; the page does not re-run them. + +Install +------- + +.. code-block:: bash + + pip install graphistry # base; pandas engine works out of the box + pip install graphistry polars # adds the CPU 'polars' engine + # 'cudf' and 'polars-gpu' require the NVIDIA RAPIDS stack (GPU); + # 'polars-gpu' additionally uses cudf_polars. + +Then change one keyword — your existing graph and query are unchanged: + +.. doc-test: skip + +.. code-block:: python + + import graphistry + g = graphistry.edges(df, 'src', 'dst') # your existing pandas, Polars, or cuDF graph + g.gfql("MATCH (a)-[e]->(b) RETURN b", engine='polars') # CPU columnar + g.gfql("MATCH (a)-[e]->(b) RETURN b", engine='polars-gpu') # same plan on GPU + +Why opt-in? +----------- + +Polars and Polars-GPU are explicit (``engine='polars'`` / ``'polars-gpu'``; ``auto`` never +picks them). The main reason is robustness, not speed: a few exotic Cypher features still +require ``engine='pandas'`` and **raise** rather than silently bridge, so auto-selecting Polars +would turn queries that work today on pandas into hard errors. (Performance is rarely the +downside — CPU Polars wins common graph queries from ~10K edges; only trivial sub-millisecond +operations favor pandas, immaterially.) Opting in keeps the default behavior unchanged and +guarantees a working result. + +See also +-------- + +- :doc:`performance` — GFQL performance overview +- :doc:`benchmark_filter_pagerank` — end-to-end CPU/GPU vs Neo4j+GDS +- :doc:`/api/gfql/index` — GFQL API reference +- :doc:`remote` — run GFQL on a remote GPU diff --git a/docs/source/gfql/index.rst b/docs/source/gfql/index.rst index 8362feb6f7..7ac79d738a 100644 --- a/docs/source/gfql/index.rst +++ b/docs/source/gfql/index.rst @@ -36,7 +36,9 @@ Recommended paths: - New to GFQL: :doc:`overview` -> :doc:`quick` -> :doc:`where` -> :doc:`return` - Running Cypher syntax in GFQL: :doc:`cypher` -> :doc:`quick` -> :doc:`return` -> :doc:`spec/cypher_mapping` -- Performance path (intro -> GPU -> remote GPU): :doc:`about` -> :doc:`performance` -> :doc:`remote` +- Faster on CPU (no GPU): :doc:`engines` -> :doc:`performance` (one keyword, ``engine='polars'``, up to ~38x over pandas) +- Performance path (intro -> engine choice -> GPU -> remote GPU): :doc:`about` -> :doc:`engines` -> :doc:`performance` -> :doc:`remote` +- Fast seeded lookups (start from known nodes, like a DB index): :doc:`index_adjacency` (O(degree), flat in graph size, 9-28x vs Kuzu/Neo4j) - Translating existing Cypher to native GFQL: :doc:`spec/cypher_mapping` - Building agents/integrations: :doc:`spec/language` + :doc:`spec/python_embedding` + :doc:`spec/wire_protocol` @@ -50,6 +52,8 @@ See also: about overview remote + Choosing an Engine + Seeded Traversal Indexes GFQL CPU & GPU Acceleration End-to-End Benchmark translate diff --git a/docs/source/gfql/index_adjacency.rst b/docs/source/gfql/index_adjacency.rst new file mode 100644 index 0000000000..fb7cffe43c --- /dev/null +++ b/docs/source/gfql/index_adjacency.rst @@ -0,0 +1,174 @@ +Seeded Traversal Indexes (CSR Adjacency) +======================================== + +A **seeded** graph query starts from a known set of nodes — "the neighbors of these +50 accounts", "2 hops out from this device" — rather than scanning the whole graph. +By default GFQL answers a seeded ``hop`` with an ``O(E)`` pass over every edge. With an +opt-in **CSR adjacency index**, the same hop becomes an ``O(degree)`` gather: its cost +depends on how many edges the *seeds* touch, not on how big the graph is. The result is +**flat in graph size** — and it beats embedded graph databases on selective lookups. + +Nothing changes about the answer. The index is a pay-as-you-go accelerator: a query either +uses a resident index or falls back to the scan, and any feature the index does not cover +also falls back — never a different result. + +When to use it +-------------- + +- **Seeded traversals**: you start from specific node ids (a watchlist, a session, a fraud + ring's known members) and hop out 1–3 steps. +- **Repeated queries** against the same graph: build the index once, amortize it over many + seeded lookups. +- **Interactive / point-lookup latency**: sub-millisecond neighbor expansion. + +It does **not** help a full-graph scan (a property filter over every node, a global +PageRank). For those, choose an *engine* instead — see :doc:`engines`. + +Quick start +----------- + +.. code-block:: python + + import graphistry + from graphistry import n, e_forward, is_in + + g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id") + + # Build the indexes once (out+in adjacency, plus a node-id accelerator when ids are unique) + g = g.gfql_index_all() + + # Seeded traversal — the index is used automatically (default index_policy='use') + my_seed_ids = ["a", "b"] # your seed node ids + out = g.gfql([n({"id": is_in(my_seed_ids)}), e_forward(), n()]) + +``gfql_index_all()`` is the one-liner. For finer control, build a single kind: + +.. code-block:: python + + g = g.create_index("edge_out_adj") # outgoing adjacency (forward hops) + g = g.create_index("edge_in_adj") # incoming adjacency (reverse hops) + g = g.create_index("node_id") # node-id lookup accelerator (unique ids only) + + g.show_indexes() # inspect what's resident + g = g.drop_index() # drop all (or drop_index("edge_out_adj")) + +The index is a **sidecar over edge row positions** — it never reorders your ``.edges`` / +``.nodes`` frames, and it is fingerprint-validated: rebinding ``.edges()`` safely +invalidates a stale index (treated as absent, never a wrong answer). + +Controlling the planner +----------------------- + +``gfql(..., index_policy=...)`` decides whether a resident index is used: + +.. list-table:: + :header-rows: 1 + :widths: 18 82 + + * - ``index_policy`` + - Behavior + * - ``'use'`` *(default)* + - Use a resident index when one covers the query; never build one. Zero overhead if + no index exists. + * - ``'auto'`` + - Build an index on the fly when the planner predicts it pays off (selective seed set). + * - ``'force'`` + - Require the index path (useful for benchmarking / asserting it is engaged). + * - ``'off'`` + - Ignore indexes entirely (the plain ``O(E)`` scan). + +Use ``g.gfql_explain(query, index_policy=...)`` to see whether the index path was taken. + +The indexes are **engine-uniform**: numpy host arrays for pandas / Polars, cupy on-device +for cuDF. They are also exposed as **Cypher DDL** (``CREATE GFQL INDEX FOR edge_out_adj``, +``DROP GFQL INDEX``, ``SHOW GFQL INDEXES`` — the mandatory ``GFQL`` token distinguishes them +from standard property ``CREATE INDEX``) and in the **JSON wire protocol** +(``{"type": "CreateIndex", ...}`` ops plus ``index_policy`` in the request envelope), so a +remote ``gfql_remote`` call can carry the same index intent. + +Performance +----------- + +**Flat in graph size.** A seeded 1-hop stays sub-millisecond as the graph grows 10×, while +the ``O(E)`` scan grows linearly. Synthetic power-law graphs, GFQL-pandas, warm median, +every cell guarded so the index path was taken *and* the indexed result equals the scan +result: + +.. list-table:: + :header-rows: 1 + :widths: 40 30 30 + + * - Seeded 1-hop + - 0.8M nodes / 6.4M edges + - 8M nodes / 64M edges + * - **Indexed (O(degree))** + - **0.124 ms** + - **0.122 ms** *(flat)* + * - Scan (O(E)) + - 105 ms + - 1045 ms + +The same holds on real power-law graphs: a typical-seed 1-hop is ~0.13 ms on LiveJournal +(35M edges) and ~0.14 ms on Orkut (117M edges), versus an ``O(E)`` scan of 367 ms → 1208 ms. + +**Beats embedded graph databases on selective lookups.** Same graph (0.8M nodes / 6.4M +edges), matched result counts, warm median. GFQL is CPU-pandas with the index; Kuzu and +Neo4j use their native indexes: + +.. list-table:: + :header-rows: 1 + :widths: 24 22 18 18 18 + + * - Task + - GFQL (indexed) + - Kuzu + - Neo4j + - GFQL speedup + * - 1-hop seeded + - **0.123 ms** + - 1.15 ms + - 1.45 ms + - 9.4× / 11.8× + * - 1–2-hop seeded + - **0.150 ms** + - 4.25 ms + - 2.54 ms + - 28× / 16.9× + +On a fairer, fully-prepared, in-process Kuzu re-run (LiveJournal 35M), GFQL is still +**17×** on a typical seed (0.126 ms vs 2.13 ms) and **6×** on a hub seed (3.76 ms vs +22.6 ms). *(Kuzu's worst-case-optimal joins can win on cyclic / multi-way-join patterns — +triangles, cliques — which these forward-expansion lookups do not exercise; we do not +claim those.)* + +**Selective traversal is CPU's game.** The indexed hop is tiny work, so the GPU's +kernel-launch floor (~3 ms on cuDF) loses to a ~0.13 ms pandas / ~0.16 ms Polars +``searchsorted`` — the clean inverse of *bulk* analytics, where the GPU pulls ahead +(see :doc:`engines`). Pick the index for selective traversal and a **CPU engine** to +drive it. + +Reproduce: ``benchmarks/gfql/index_takeover_bench.py``, +``benchmarks/gfql/index_vs_dbs.py``, ``benchmarks/gfql/index_vs_kuzu_prepared.py``. +Hardware: DGX ``dgx-spark``, GB10 GPU. + +Honesty and cost +---------------- + +- **Build cost** is one ``O(E log E)`` sort, amortized over subsequent queries. + ``index_policy='auto'`` only builds when the planner predicts a selective query will + pay it back. +- **No change to default behavior.** With no index resident and ``index_policy='use'`` + (the default), queries run exactly as before. +- **Parity-or-fallback.** The index accelerates the seeded scan sites it covers (forward / + reverse hop, the Polars hop, the single-hop chain fast path). Any uncovered feature — + edge / source / destination match, ``target_wave_front``, ``min_hops>1``, labeling — + falls back to the scan/join path. The indexed subgraph is verified equal to the scan + subgraph in differential tests across pandas / cuDF / Polars / Polars-GPU. It is an + accelerator, never a source of a different answer. + +See also +-------- + +- :doc:`engines` — choosing pandas / Polars / cuDF / Polars-GPU for non-seeded work. +- :doc:`performance` — the vectorization + GPU design behind GFQL. +- :doc:`benchmark_filter_pagerank` — an end-to-end filter → PageRank → filter comparison vs Neo4j. diff --git a/docs/source/gfql/overview.rst b/docs/source/gfql/overview.rst index 1a12f475e1..c4ce1aaa89 100644 --- a/docs/source/gfql/overview.rst +++ b/docs/source/gfql/overview.rst @@ -24,7 +24,7 @@ GFQL addresses a critical gap in the data community by providing an in-process g Key Features ~~~~~~~~~~~~~ -- **Dataframe-Native Integration**: Works directly with Pandas, cuDF, and Apache Arrow dataframes. +- **Dataframe-Native Integration**: Works directly with Pandas, Polars, cuDF, and Apache Arrow dataframes. - **High Performance**: Optimized for both CPU and GPU execution, capable of processing billions of edges. - **Ease of Use**: Install via `pip` and start querying without the need for external databases. - **Seamless Visualization**: Integrated with PyGraphistry for GPU-accelerated graph visualization. @@ -316,9 +316,11 @@ Key advantages of GFQL Let: Leveraging GPU Acceleration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -GFQL is optimized to take advantage of GPU acceleration using `cudf` and RAPIDS. When you use GPU dataframes, GFQL automatically executes queries on the GPU for massive speedups. +GFQL runs the same query on four interchangeable engines, all returning identical results: ``pandas`` (CPU, default), ``polars`` (CPU columnar — up to ~38x over pandas, **no GPU**), ``cudf`` (NVIDIA GPU), and ``polars-gpu`` (NVIDIA GPU). ``engine='auto'`` resolves to ``cudf`` for cuDF input and ``pandas`` otherwise; ``polars`` / ``polars-gpu`` are explicit opt-in (``auto`` never selects them — **so a Polars-frame graph run with the default is coerced to pandas; pass** ``engine='polars'`` **to stay native**). Neither silently bridges: ``polars-gpu`` is GPU-or-error (it raises rather than silently running on CPU), and ``polars`` (CPU) raises ``NotImplementedError`` for the few unsupported Cypher features rather than falling back to pandas. See :doc:`Choosing an Engine ` for the decision matrix and benchmarks. -**Automatic GPU Acceleration** +When you use cuDF (GPU) dataframes with ``engine='auto'``, GFQL executes queries on the GPU for massive speedups. + +**Automatic GPU Acceleration (cuDF)** Example: Run GFQL queries with GPU dataframes. @@ -338,13 +340,15 @@ Example: Run GFQL queries with GPU dataframes. g_result = g_gpu.gfql([ ... ]) # Your GFQL query here print('Number of resulting edges:', len(g_result._edges)) -**Forcing GPU Mode** +**Selecting an Engine Explicitly** -Example: Explicitly set the engine to ensure GPU execution. +Example: set the engine for a CPU columnar speedup or to force a specific GPU engine. .. code-block:: python - g_result = g_gpu.gfql([ ... ], engine='cudf') + g_result = g.gfql([ ... ], engine='polars') # CPU columnar, no GPU + g_result = g_gpu.gfql([ ... ], engine='cudf') # NVIDIA GPU, eager + g_result = g_gpu.gfql([ ... ], engine='polars-gpu') # NVIDIA GPU, fused plan Run Remotely ~~~~~~~~~~~~~ diff --git a/docs/source/gfql/performance.rst b/docs/source/gfql/performance.rst index 1c7334840b..0043fa5d35 100644 --- a/docs/source/gfql/performance.rst +++ b/docs/source/gfql/performance.rst @@ -1,92 +1,81 @@ .. _gfql-performance: -GFQL Performance: Unleashing Vectorization and GPU Power for Scalable Graph Analytics -====================================================================================== - -GFQL, developed by Graphistry, rethinks graph analytics by harnessing vectorization and GPU acceleration. As datasets grow from thousands to billions of rows, traditional tools struggle to keep up without significant infrastructure investment. GFQL is rewriting the story. Start small with a quick `pip install graphistry` on your CPU system, and scale more smoothly by leveraging the power of vectorization and GPUs to handle historically tricky datasets. - -Built from Real-World Necessity -------------------------------- - -GFQL was born out of the challenges our team faced across many graph customer projects over the last 10 years. Projects often start with manageable datasets, and as they scale up, require tools that can grow without imposing prohibitive costs or complexities. Likewise, traditional graph solutions often require adding additional storage tier infrastructure and systems of record that duplicate a team's existing standard databases and warehouses: Too many projects died from premature distractions and complexities here. - -We have `long recognizing the untapped potential of CPUs and GPUs in the compute tier `_ and the lack of effective libraries to leverage them for graph analytics. GFQL fills this gap. We designed GFQL to integrate seamlessly with the graph and dataframe ecosystem, providing a much easier, unified, and scalable solution while eliminating the need for hazardous storage tier detours. - -A New Era of Graph Analytics ----------------------------- - -Graphistry has a history of award-winning open source data visualization and GPU acceleration engines. With GFQL, we bring our lessons learned to graph querying and analysis for real-time insights on datasets both big and small. Unlike traditional graph databases that process one path at a time, GFQL traverses entire collections simultaneously. Similar to best-of-class analytical CPU databases like Clickhouse and Google BigQuery, our vectorized approach maximizes throughput to drastically reduce query time. - -When coupled with GPU acceleration, GFQL's performance reaches Graph 500 levels with even the cheapest cloud GPUs. Modern GPUs execute tens of thousands of threads in parallel, and GFQL is designed to fully saturate this capability. Whether you're traversing graphs with billions of edges or running complex algorithms, GFQL transforms previously impractical tasks into manageable ones. - - -Three Simple Ideas Behind GFQL's Performance ---------------------------------------------- - -At the core of GFQL's performance are three pioneering techniques: - -**Collection-Oriented Algorithms** - -GFQL operates on entire collections of nodes and edges simultaneously, different from older commercial Cypher and Gremlin graph query engines that process one path at a time. The collection-oriented approach, inspired by our research at UC Berkeley and our experience with GPUs, maximizes data throughput and minimizes computational overhead. Small queries stay interactive, and large-scale graph analytics is now more efficient than ever before. - -**Vectorized Columnar Processing** - -GFQL processes data in large, parallel batches using columnar data structures. This method optimizes memory usage and computational efficiency, significantly speeding up data handling compared to traditional row-based systems. Natively integrating with cutting-edge technologies like `Apache Arrow `_, this approach ensures high performance even on CPUs, and unusually fast speeds for moving large data across systems. - -**Massive Parallelism with GPUs** - -Designed to saturate the tens of thousands of threads in modern GPUs, GFQL enables rapid processing of complex graph queries. This massive parallelism allows GFQL to handle tasks that are impractical on typical CPU systems, such as real-time traversals that touch hundreds of millions of edges and compute on them. - - -Seamless Scalability from CPUs to GPUs --------------------------------------- - -GFQL allows you to start analyzing graphs on standard CPUs without specialized hardware. As your data grows, you can transition to GPU acceleration without changing your code. GFQL intelligently utilizes available hardware to optimize performance, ensuring efficient resource use whether you're on a single machine or across a cluster. - -By eliminating the need for additional infrastructure, GFQL reduces time and expense, allowing you to focus on extracting insights from your data. This seamless scalability ensures that as your projects evolve, GFQL adapts to meet your needs. - -Optimized for Analytical Workloads ----------------------------------- - -GFQL excels in scenarios requiring deep analytical capabilities. It is designed for: - -- **Graph ETL and Analytics**: Efficiently process and transform large volumes of graph data. -- **Machine Learning and AI**: Accelerate graph-based ML and AI tasks, leveraging GPUs for training and inference. -- **Visualization**: Power high-performance graph visualizations, enabling real-time interaction with complex datasets. - -By focusing on these areas, GFQL meets the demands of modern data projects, from initial exploration to advanced analysis, without the overhead typically associated with large-scale analytics. +GFQL Performance: Vectorization and GPU Acceleration +==================================================== + +Engine speedups at a glance +--------------------------- + +GFQL runs the **same query** on four interchangeable engines — ``pandas`` (default), +``polars`` (CPU, columnar), ``cudf`` (NVIDIA GPU), and ``polars-gpu`` (GPU) — and returns +**identical results** on each (differential parity is a release gate). The biggest, easiest +win is one keyword, **no GPU required**: + +.. doc-test: skip + +.. code-block:: python + + g.gfql(query) # engine='pandas' (default) + g.gfql(query, engine='polars') # up to ~38x faster on real graphs, same results + +Warm-median latency, same query, identical result rows (**Orkut**, 117M edges, SNAP): + +.. list-table:: + :header-rows: 1 + :widths: 40 15 15 15 15 + + * - Workload (117M edges) + - ``pandas`` + - ``polars`` + - ``cudf`` + - ``polars-gpu`` + * - 1-hop from 10K seeds + - 2613 ms + - **68 ms** + - 1005 ms + - 63 ms + * - Full out-degree aggregation + - 799 ms + - 205 ms + - 314 ms + - **167 ms** + +There is **no universal winner**: ``polars`` typically takes over from ~10K edges up +(``pandas`` still wins trivial sub-millisecond operations), and the right GPU +engine depends on the workload. See :doc:`engines` for the full decision matrix, the honest +"when *not* to use Polars", the cuDF-vs-Polars-GPU comparison, and the methodology + reproducer +scripts behind these numbers. The end-to-end CPU/GPU-vs-Neo4j benchmark is in +:doc:`benchmark_filter_pagerank`. + +How GFQL is fast +---------------- + +Three design choices explain the numbers above: + +**Collection-oriented execution.** GFQL evaluates whole collections of nodes and edges at +once (set-at-a-time), rather than walking one path at a time like traditional Cypher/Gremlin +engines. A traversal advances by joining edge tables, so the work vectorizes. + +**Vectorized columnar processing.** Data is processed in columnar batches on top of +`Apache Arrow `_, which keeps the CPU path fast and makes moving +data between systems cheap. The ``polars`` engine additionally builds **one fused lazy plan +and collects once**, which is why it outruns both pandas and eager cuDF on bulk work. + +**Massive parallelism on GPUs.** On an NVIDIA GPU (``cudf`` / ``polars-gpu``), the same +vectorized work saturates tens of thousands of threads — paying off when there is enough +work to amortize kernel-launch cost (large frontiers, dense joins, full-graph aggregation). + +Start on CPU with no special hardware, and move to a GPU engine by changing one keyword when +your workload grows into GPU territory. See :doc:`engines` for exactly when each engine wins. .. note:: Same-path constraints (``where``) can be more expensive on dense graphs. Prefer selective per-step predicates and see :doc:`/gfql/where` for details. -Built on Graphistry's Expertise -------------------------------- - -Graphistry's reputation for leveraging GPUs and vectorization in data analytics is well-established. GFQL embodies this expertise, filling gaps in the graph and dataframe ecosystem by providing tools that maximize GPU utilization and integrate with open-source technologies like Apache Arrow. Our collaboration with `NVIDIA `_, including their investment into our team, ensures that GFQL benefits from optimized kernel methods for top-tier performance. - -Empower Your Data Journey -------------------------- - -With GFQL, you can start quickly, scale more smoothly, and leverage cutting-edge performance. It empowers you to: - -- Begin analyzing graphs immediately on your existing hardware -- Grow from CPU to GPU processing without code changes -- Handle datasets ranging from thousands to billions of edges efficiently - -Whether you're analyzing social networks, investigating cybersecurity threats, or exploring intricate datasets, GFQL transforms how you work with graph data, making complex analytics accessible and efficient. - -Join the Graphistry Community ------------------------------ - -We invite you to become part of our community dedicated to advancing graph analytics through innovation in vectorization and GPU computing. Let's keep pushing the boundaries of what's possible! - ---- - Next Steps ---------- -- **Explore GFQL**: Dive deeper into GFQL's capabilities in :ref:`10min-gfql`. -- **Get Started with PyGraphistry**: Follow the :ref:`10min-pygraphistry` to setup and experience the performance firsthand. -- **Learn About Vectorization and GPUs**: Understand the partner ecosystem technologies behind GFQL by exploring `Apache Arrow `_ and `NVIDIA RAPIDS `_. -- **Connect with Us**: Join our :ref:`community` to share insights and collaborate with others pushing the boundaries of graph analytics. +- **Choose an engine**: :doc:`engines` — the full decision matrix, methodology, and reproducers. +- **End-to-end benchmark**: :doc:`benchmark_filter_pagerank` — CPU/GPU vs Neo4j+GDS. +- **Explore GFQL**: :ref:`10min-gfql`. **Get started**: :ref:`10min-pygraphistry`. +- **Ecosystem**: `Apache Arrow `_ and `NVIDIA RAPIDS `_. diff --git a/docs/source/gfql/quick.rst b/docs/source/gfql/quick.rst index 2b2d63fd6e..ed469f27b3 100644 --- a/docs/source/gfql/quick.rst +++ b/docs/source/gfql/quick.rst @@ -27,7 +27,7 @@ Basic Usage :meth:`gfql ` sequences multiple matchers for more complex patterns of paths and subgraphs - **query**: Sequence of graph node/edge matchers and optional row-pipeline call steps (for example, `rows()`, `where_rows()`, `return_()`, `order_by()`, `limit()`), or an equivalent GFQL chain object. -- **engine**: Optional execution engine. Engine is typically not set, defaulting to `'auto'`. Use `'cudf'` for GPU acceleration and `'pandas'` for CPU. +- **engine**: Optional execution engine. Engine is typically not set, defaulting to `'auto'`. Use `'polars'` for a CPU columnar speedup (up to ~38x over pandas, no GPU), `'cudf'` or `'polars-gpu'` for NVIDIA GPU acceleration, and `'pandas'` for the default CPU path. See :doc:`Choosing an Engine `. Native GFQL chains are typed Python inputs. Pass the list, dict envelope, or ``Chain`` object itself; strings passed to ``g.gfql(...)`` are interpreted as @@ -400,14 +400,23 @@ Combined Examples n(query="status == 'active'") ]) -GPU Acceleration ----------------- +Engine Selection (CPU and GPU) +------------------------------ -- **Enable GPU mode:** +The same query runs on four interchangeable engines with identical results. Pick one +with ``engine=``. See :doc:`Choosing an Engine ` for the full decision matrix. + +- **CPU columnar speedup (no GPU):** ``'polars'`` — up to ~38x over pandas on real graphs. .. code-block:: python - g.gfql([...], engine='cudf') + g.gfql([...], engine='polars') # keep your existing pandas frames; just the keyword changes + +- **NVIDIA GPU:** ``'cudf'`` (eager) or ``'polars-gpu'`` (fused plan on GPU). + + .. code-block:: python + + g.gfql([...], engine='polars-gpu') - **Example with cuDF DataFrames:** diff --git a/docs/source/notebooks/gpu.rst b/docs/source/notebooks/gpu.rst index 52293d5575..eea06fcef4 100644 --- a/docs/source/notebooks/gpu.rst +++ b/docs/source/notebooks/gpu.rst @@ -1,6 +1,11 @@ GPU ========================== +GFQL has two NVIDIA GPU engines: ``engine='cudf'`` (RAPIDS, eager) and +``engine='polars-gpu'`` (the fused lazy Polars plan on GPU). See +:doc:`Choosing a GFQL Engine ` for which to use and how they compare to the +CPU ``pandas`` / ``polars`` engines. + .. toctree:: :maxdepth: 2 :caption: GPU compute with Nvidia RAPIDS