fix(planner): defer index search and provider I/O to execute() time by anoop-narang · Pull Request #2 · hotdata-dev/datafusion-vector-search-ext

anoop-narang · 2026-03-11T08:09:19Z

Summary

Root cause: plan_extension was doing all real work (HNSW index search, full provider collect(), fetch_by_keys) during physical planning. With 1M+ rows this causes an O(N) memory spike at plan time, blocks the planner thread, hides I/O from EXPLAIN ANALYZE, and prevents query cancellation.
Unfiltered path: usearch_search + fetch_by_keys now run inside USearchExec::execute() via a futures::stream::once async block bridged through RecordBatchStreamAdapter.
Filtered path: The provider scan is pre-planned (cheap: creates an in-memory execution plan) in plan_extension using SessionState, then executed lazily at query time using scan_plan.execute(0, task_ctx). Iteration is now a streaming loop (stream.next().await) instead of collect(), giving O(1) memory for the scan phase.
No behaviour change for existing tests — all 27 pass unchanged.

Test plan

cargo check — no errors
cargo clippy — no errors (two expected too_many_arguments warnings on private async fns)
cargo test — all 27 tests pass
After merge: EXPLAIN ANALYZE on a vector search query should show real execution time inside USearchExec instead of near-zero cost

Previously plan_extension performed all real work — HNSW index search, full provider scan with collect(), and fetch_by_keys — during physical planning. This caused O(N) memory spikes at plan time for large tables, hid I/O from DataFusion execution metrics, and prevented cancellation. plan_extension is now purely structural: validates the registry entry and compiles filter Exprs to PhysicalExprs. All I/O moves into USearchExec::execute(), which returns a SendableRecordBatchStream via RecordBatchStreamAdapter + futures::stream::once. The filtered path streams the provider scan batch-by-batch (O(1) memory) instead of collecting the entire table into memory. The provider scan plan is pre-planned in plan_extension (cheap: creates a MemoryExec-like plan from the in-memory MemTable) and executed lazily at query time. Also: add Debug impl for USearchRegistry; fix udtf.rs to use a local BatchExec instead of the old 3-arg USearchExec::new.

- cargo fmt: reformat execute() stream chain, usearch_execute signature, usearch_search call, and BatchExec::new struct literal - clippy: extract SearchParams struct to group the 11 search-config fields, reducing usearch_execute to 2 args and adaptive_filtered_execute to 4 args (eliminates both too_many_arguments warnings without #[allow]) - private_interfaces: make USearchExec::new private (only called within planner.rs; SearchParams is module-private)

anoop-narang added 2 commits March 11, 2026 13:38

anoop-narang merged commit 4d685a8 into main Mar 13, 2026
4 checks passed

anoop-narang deleted the fix/lazy-exec-at-execute-time branch March 13, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): defer index search and provider I/O to execute() time#2

fix(planner): defer index search and provider I/O to execute() time#2
anoop-narang merged 2 commits into
mainfrom
fix/lazy-exec-at-execute-time

anoop-narang commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anoop-narang commented Mar 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant