diff --git a/docs/plans/773-load-expr-handler.md b/docs/plans/773-load-expr-handler.md index d5bc7c299..161ee1eb3 100644 --- a/docs/plans/773-load-expr-handler.md +++ b/docs/plans/773-load-expr-handler.md @@ -73,9 +73,10 @@ staticmethod to module level so the server can call it too). buckaroo[xorq]` hint. Other endpoints unaffected. 7. **Stats latency**: synchronous in v1, matching pandas/polars - `/load`. Async streaming stats (esp. for search) are a - follow-up — biggest payoff is search-on-remote-backend where - each keystroke re-runs the stats pipeline. + `/load`. Async streaming stats (esp. for search) tracked + separately in [`async-stats.md`](./async-stats.md) — biggest + payoff is search-on-remote-backend where each keystroke + re-runs the stats pipeline. 8. **Dispatch shape**: branch inside each existing `mode == "buckaroo"` arm at three sites: diff --git a/docs/plans/async-stats.md b/docs/plans/async-stats.md new file mode 100644 index 000000000..4a5eeb985 --- /dev/null +++ b/docs/plans/async-stats.md @@ -0,0 +1,229 @@ +# Scoped summary stats (raw / cleaned / filtered) and async stats + +## Status + +- **V1 (this doc)** — scoped sds (raw + cleaned + filtered) with declarative + `?`-optional pinned rows. Sync compute, no protocol changes. Ships in two PRs + (JS first, then backend). This is the load-bearing change. +- **V2 (future, separate issue)** — async stats over WebSocket (two-message + protocol, `asyncio.to_thread`, state-change keys, stale-response handling). + Original motivation for this file; see "V2: async stats (deferred)" below. + +## V1 — scoped summary stats + +### Problem + +Today the dataflow computes `merged_sd` against `processed_df`, which is the +df after cleaning AND search filter are applied. When the user types in the +search box, the existing stats are replaced. There is no way to compare "stats +on what I searched" with "stats on what I started with" — the unfiltered view +is gone. + +We want both visible at once, especially histograms, so a user can see how +their search restricts the distribution. + +While we're at it: when auto-cleaning is active, the cleaned view is also +distinct from the raw input — comparing those is useful too. So three scopes, +not two: + +- **raw** — stats on `sampled_df` (the input, post-sampling). +- **cleaned** — stats on the cleaned-but-unfiltered df. Only meaningful when + `cleaning_method != ""`. +- **filtered** — stats on the cleaned-and-filtered df. Only meaningful when + the search/quick_command produces non-empty quick_ops. + +### Data model + +Three sd traits on the dataflow, each with its own observer chain and cache +key, conditionally emitted: + +| Trait | Source df | Emitted when | +|-------|-----------|--------------| +| `summary_sd_raw` | `sampled_df` | always | +| `summary_sd_cleaned` | new `cleaned_df_unfiltered` | `cleaning_method != ""` | +| `summary_sd_filtered` | new `cleaned_df_filtered` | quick_ops non-empty | + +A merge step flattens the active scopes into a single `merged_sd` whose +row-key prefixes encode scope: bare stat names (`mean`, `histogram_bins`) come +from raw; `cleaned_*` and `filtered_*` prefixed keys come from the respective +scopes. Scope keys are absent when the scope isn't emitted. + +This is a **deliberate breaking change** to bare-name semantics. Today, +`merged_sd["col"]["mean"]` is the post-everything mean; after this change it +is the raw mean. Tests touching `dataflow.merged_sd` after applying cleaning +or filter need to be updated to assert on `cleaned_*` / `filtered_*` keys. +Document in CHANGELOG. + +### `handle_ops_and_clean` signature change + +Today: `[cleaned_df, cleaning_sd, generated_code, merged_operations]`. Today's +`cleaned_df` is actually post-cleaning *and* post-search — the name is +misleading. + +New: `[cleaned_df_unfiltered, cleaned_df_filtered_or_None, cleaning_sd, +generated_code, merged_operations]`. + +Implementation: run the df interpreter once on `cleaning_ops` to get +`cleaned_df_unfiltered`. If `quick_ops` is non-empty, run the interpreter +again on `cleaned_df_unfiltered` with `quick_ops` to get `cleaned_df_filtered`. +Otherwise `cleaned_df_filtered = None`. + +When `cleaning_ops` is empty, `cleaned_df_unfiltered` is `sampled_df` by +reference (preserving the existing short-circuit identity invariant; see +`feedback_short_circuit_identity` memory). + +The existing `cleaned_df` property on the dataflow returns "filtered if +present, else unfiltered" so downstream callers continue to see the visible +df they expect. + +### Wire format + +No top-level change to `df_data_dict`. The `all_stats` wide DF inside it +gains rows with prefixed indices: + +```text +mean, min, max, histogram_bins, ... # raw (always present) +cleaned_mean, cleaned_histogram_bins, ... # cleaned (when active) +filtered_mean, filtered_histogram_bins, ... # filtered (when active) +``` + +`df_data_dict` continues to re-sync as a whole trait. **No new WebSocket +message types, no deltas, no state-change keys, no stale-response handling +in V1.** The cost of re-sending unchanged scope rows on state change is +negligible relative to the simplicity of avoiding frontend/backend +divergence in a delta protocol. This is reversible if V2 needs it. + +### Pinned rows config — `?` optional marker + +The `pinned_rows` config gains a `?` prefix marker on `primary_key_val`. An +entry whose `primary_key_val` starts with `?` is **rendered iff the +unprefixed key exists in `all_stats`**: + +```jsonc +pinned_rows: [ + { "primary_key_val": "histogram_bins", ... }, // required: always rendered + { "primary_key_val": "?cleaned_histogram_bins", ... }, // optional: rendered only when key exists + { "primary_key_val": "?filtered_histogram_bins", ... } +] +``` + +The `?` lives only in the config; data keys in `all_stats` are always bare +(`cleaned_histogram_bins`, not `?cleaned_histogram_bins`). + +Default pinned_rows configs ship with `?cleaned_*` / `?filtered_*` entries +so the feature lights up automatically when scopes activate. Custom configs +opt in by adding their own `?`-prefixed entries. + +This same mechanism handles the **categorical↔numerical column-type case**: +if cleaning converts a column from categorical to numerical, the raw sd has +`value_counts` for that column but the cleaned sd has `cleaned_histogram_bins`. +A config listing all four (`?value_counts`, `?cleaned_value_counts`, +`?histogram_bins`, `?cleaned_histogram_bins`) renders the right combination +per column without any branching. + +### Quick op tagging + +Quick (search) ops are not currently tagged in `generate_quick_ops` +(`autocleaning.py:83-104`); the comment in `dataflow.py:146` that says they +are is aspirational. Either: + +- Tag them with `meta:{'quick_command': True}` for symmetry with cleaning + ops (one-line addition), or +- Skip the tagging and rely on source provenance — `produce_cleaning_ops` + and `generate_quick_ops` are the only producers, and `handle_ops_and_clean` + knows which is which at call-site. + +Either works. Pick the lighter one when implementing. + +### Cost + +Worst case is initial load with cleaning active: two stats-pipeline runs +(raw + cleaned) instead of today's one. On xorq this is one extra Snowflake +query batch. **Acceptable because**: + +- Cost is only paid when cleaning is opted into; default (`cleaning_method + = ""`) sees no change. +- xorq query caching (separate workstream) absorbs the redundancy across + calls. +- State changes after the initial load only recompute the filtered scope; + raw and cleaned are cached. + +If this turns out to bite, the future opt-out is "pinned_rows-driven +selective compute" (server scans config for `cleaned_*` / `filtered_*` +references and only computes the referenced scopes). Don't build this in +V1. + +### PR sequencing + +1. **`plans/` → `docs/plans/` move + this doc.** Mechanical, ships first. +2. **JS PR — `?` prefix support in `pinned_rows`.** Strict superset of + today's behavior since no `?` keys exist anywhere yet. Safe to merge + alone — no observable change. Bakes in CI before the backend PR. +3. **Backend PR — dual return from `handle_ops_and_clean`, three sd traits, + conditional scope emission, prefixed row keys in `all_stats`, default + pinned_rows entries with `?cleaned_*` / `?filtered_*`, test updates + for the bare-name semantic change.** + +Either of the latter two can land first; backend without JS leaves the +scoped rows present in data but unrendered (silent), and JS without backend +adds an unused mechanism (silent). Combined, the feature lights up. + +## V2 — async stats (deferred) + +The original motivation for this file. Rough shape preserved here: + +`XorqServerDataflow(expr, skip_main_serial=True)` runs the stats pipeline +synchronously inside the Tornado IO loop. The worst case is search / +state change: every keystroke into the search box re-runs the stats +pipeline against the filtered expression. On a remote backend that's +"user types `foo`" → six Snowflake aggregate queries stacked. Even +sub-second per query is sluggish. + +Once V1 ships and we want responsive xorq filtering: + +- **Two-message protocol per state change.** Message 1 (immediate): + new `df_meta`, `df_data_dict.main = []`, `all_stats` containing only + the unchanged raw + cleaned scope rows (`filtered_*` absent). Message 2 + (when filtered scope finishes computing): full `df_data_dict` + replacement with `filtered_*` rows added. State-change key + round-tripped on both messages, just like `infinite_request` round-trips + `payload_args`. +- **`asyncio.to_thread` (or `IOLoop.run_in_executor`)** for the filtered + scope's compute between the two messages. Bounded concurrency via a + global semaphore; per-session "latest only" coordinator can come later + if pre-flight stale-key checks aren't enough. +- **Cancellation = pre-flight stale check.** At the start of computing + filtered_sd, check whether the state-change key is still current. If + not, drop. No real cancellation needed; computes are short relative to + typing cadence. +- **Session-liveness check** before pushing message 2. + +V2 reuses V1's data model and wire format verbatim — the only changes are +WebSocket protocol additions. V1 ships sync; V2 is purely an optimization +on top. + +## Separate issues to file + +- **Histogram bin-edge sharing across scopes.** Filtered/cleaned + histograms should share the raw scope's bin edges so the histograms are + visually comparable across scopes (otherwise the x-axes differ). Requires + the Histogram `ColAnalysis` to accept an optional `bins_override` input + threaded from upstream. Both `DfStatsV2` and `XorqStatPipeline` need to + support "given bin edges, count rows per bin." +- **Categorical↔numerical column-type handling in histograms.** The `?` + optional mechanism already handles this declaratively (different stat + keys exist per scope per column), but the displayer config + Histogram + analysis behavior across type changes deserves explicit tests and + documentation. +- **`depends_on_order: bool` on `ColAnalysis`.** Sort doesn't need to + recompute order-independent stats; today's stats are all order-independent + (monotonicity testing isn't in the summary set). Adding the flag lets + future order-dependent stats opt in to recompute-on-sort, and lets the + sort path skip stat recompute entirely for the default case. + +## Related + +- #700 (closed) — `XorqStatPipeline` single-round-trip per phase. Sets the + floor for how fast sync stats can ever be; foundational for V2. +- #773 — the load-expr handler this file branched off of. Ships with the + v1-equivalent (sync stats, single scope) per its v1 decision.