Skip to content

add a DuckDB/Node backend for buckaroo-js-core: stats + windowed search/sort/paging with no Python kernel #930

Description

@paddymul

Problem

buckaroo-js-core is already transport-agnostic — IModel, BuckarooView, and WebSocketModel (packages/buckaroo-js-core/src/server/) let the viewer run over a raw WebSocket instead of a Jupyter comm, and buckaroo/server/ proves it end to end. But every backend that actually answers infinite_request or computes the summary-stat SDType is Python: pandas (buckaroo_widget.py), polars (polars_buckaroo.py), lazy polars, and ibis/xorq (xorq_stat_pipeline.py). There is no backend that runs in a pure Node/Electron host, so any non-Python consumer must ship and supervise a Python process.

There's a downstream proof point. An Electron app embeds buckaroo-js-core@^0.15.0 with @duckdb/node-api and no Python. To do it, it had to work around the missing backend: (a) recompute column stats with DuckDB SUMMARIZE and flatten them to plain string cells through a private adapter, bypassing the native summary-stats / pinned_rows / histogram contract; and (b) fall back to the static DFViewer over a row-capped array (SELECT * FROM (stmt) LIMIT 101), so it gets no server-side search, filter, sort, or paging — sort is ag-grid client-side over the fetched rows. The DuckDB/Node compute exists in the wild but reimplements a thin slice of buckaroo outside buckaroo and leaves DFViewerInfinite / IDatasource / SmartRowCache unused.

Suggested fix

Add a first-class DuckDB/Node backend satisfying buckaroo's two existing contracts, so the JS-core viewer renders the same as it does behind Python. No JS-core changes required — IModel is the seam.

  1. Stats → produce SDType (Dict[col, Dict[stat, val]], col_analysis.py:7-13) from DuckDB SQL. The SQL stat set already exists in customizations/xorq_stats_v2.py (typing, null_count, min/max, distinct_count, histogram, quantiles); port those expressions to DuckDB (SUMMARIZE covers most; targeted queries for histogram bins and exact/approx distinct). Emit the transposed summary_stats_data the viewer expects — row per stat, index=stat name, histogram_bins/histogram_log_bins as number[] per column (gridUtils.ts:387 extractSDFT, resolveDFData.ts).

  2. Rows / search / sort / paging → answer infinite_request/infinite_resp (the PayloadArgs/PayloadResponse protocol, SmartRowCache.ts:17-33) by translating each [start,end) window + sort/sort_direction into ... ORDER BY <col> LIMIT <n> OFFSET <start> and the search term into a WHERE … ILIKE predicate, returning rows as Parquet — the Node analog of server/data_loading.py's search_df_str → sort_values → slice → to_parquet. Because DuckDB filters/sorts/pages in SQL, it sidesteps the live path's full-df re-sort per request (what docs/smart-row-cache-redesign.md targets) and the per-window re-execution in row paging on an aggregate-backed xorq session re-executes the full aggregation per 300-row window #923.

Two altitudes, both already supported:

  • a Node host drives WebSocketModel — a TypeScript twin of the Tornado DataStreamHandler (separate companion process); or
  • implement IModel in-process and reuse getKeySmartRowCache + getDs unchanged (embedded Electron app).

On the Python side this parallels the DFStatsClass seam (dataflow.py:308) and the XorqDfStatsV2 precedent; a DuckDbDfStatsV2 could give Python users a DuckDB stats backend too, but the headline is the Node host that needs no Python.

Open questions

Context

Identified while integrating buckaroo-js-core@^0.15.0 into a downstream Electron app (@duckdb/node-api, no Python). Read against origin/main at the 0.15.1 release (fde5213b). The live row path is SmartRowCache/infinite_request — the rowid redesign in docs/smart-row-cache-redesign.md is built but unwired. Relates to #923, #911, #918.

Metadata

Metadata

Assignees

No one assigned

    Labels

    JSrequires js work to fixLightweight-JS-DataframeenhancementNew feature or requestspikeExploratory/parallel implementation; may be built differently elsewhere

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions