Skip to content

fix(planner): defer index search and provider I/O to execute() time#2

Merged
anoop-narang merged 2 commits into
mainfrom
fix/lazy-exec-at-execute-time
Mar 13, 2026
Merged

fix(planner): defer index search and provider I/O to execute() time#2
anoop-narang merged 2 commits into
mainfrom
fix/lazy-exec-at-execute-time

Conversation

@anoop-narang

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: plan_extension was doing all real work (HNSW index search, full provider collect(), fetch_by_keys) during physical planning. With 1M+ rows this causes an O(N) memory spike at plan time, blocks the planner thread, hides I/O from EXPLAIN ANALYZE, and prevents query cancellation.

  • Unfiltered path: usearch_search + fetch_by_keys now run inside USearchExec::execute() via a futures::stream::once async block bridged through RecordBatchStreamAdapter.

  • Filtered path: The provider scan is pre-planned (cheap: creates an in-memory execution plan) in plan_extension using SessionState, then executed lazily at query time using scan_plan.execute(0, task_ctx). Iteration is now a streaming loop (stream.next().await) instead of collect(), giving O(1) memory for the scan phase.

  • No behaviour change for existing tests — all 27 pass unchanged.

Test plan

  • cargo check — no errors
  • cargo clippy — no errors (two expected too_many_arguments warnings on private async fns)
  • cargo test — all 27 tests pass
  • After merge: EXPLAIN ANALYZE on a vector search query should show real execution time inside USearchExec instead of near-zero cost

Previously plan_extension performed all real work — HNSW index search,
full provider scan with collect(), and fetch_by_keys — during physical
planning. This caused O(N) memory spikes at plan time for large tables,
hid I/O from DataFusion execution metrics, and prevented cancellation.

plan_extension is now purely structural: validates the registry entry
and compiles filter Exprs to PhysicalExprs. All I/O moves into
USearchExec::execute(), which returns a SendableRecordBatchStream via
RecordBatchStreamAdapter + futures::stream::once.

The filtered path streams the provider scan batch-by-batch (O(1) memory)
instead of collecting the entire table into memory. The provider scan
plan is pre-planned in plan_extension (cheap: creates a MemoryExec-like
plan from the in-memory MemTable) and executed lazily at query time.

Also: add Debug impl for USearchRegistry; fix udtf.rs to use a local
BatchExec instead of the old 3-arg USearchExec::new.
- cargo fmt: reformat execute() stream chain, usearch_execute signature,
  usearch_search call, and BatchExec::new struct literal
- clippy: extract SearchParams struct to group the 11 search-config fields,
  reducing usearch_execute to 2 args and adaptive_filtered_execute to 4 args
  (eliminates both too_many_arguments warnings without #[allow])
- private_interfaces: make USearchExec::new private (only called within
  planner.rs; SearchParams is module-private)
@anoop-narang anoop-narang merged commit 4d685a8 into main Mar 13, 2026
4 checks passed
@anoop-narang anoop-narang deleted the fix/lazy-exec-at-execute-time branch March 13, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant