Commit 80059cc
authored
feat: Parquet-native adaptive filter for vector search (#7)
* feat(planner): Parquet-native adaptive filter with two-scan approach
Pre-scan uses projection (scalar + _key only, no vector column) with
filter pushdown for selectivity estimation. Low-selectivity path
(<= 5%) bypasses USearch and SQLite entirely — streams full Parquet
scan, evaluates filters inline, computes distances, and maintains a
top-k heap (ScoredRow) to return results directly.
High-selectivity path unchanged: HNSW filtered_search → SQLite fetch.
- Remove old brute-force path and heap_select_top_k helper
- Add parquet_native_execute() for the low-selectivity path
- Add ScoredRow struct with Ord impl for max-heap eviction
* refactor(planner): expose scan plans as USearchExec children
Report provider_scan and full_scan as children so DataFusion's
physical optimizer can traverse and optimize the scan plans.
with_new_children correctly replaces whichever scans are present.
* test(execution): add parquet-native path coverage
Force the low-selectivity path via brute_force_selectivity_threshold=1.0
with a lookup provider that excludes the vector column (matching the
real Parquet+SQLite deployment). Four new tests:
- WHERE exclusion with distance ordering
- Equality filter with distance ordering
- LIMIT < matching rows (heap eviction)
- WHERE with no matches (empty result)
* docs: rewrite README for split-provider architecture
Update usage examples for the scan_provider + lookup_provider API.
Document the three-path adaptive filtering (unfiltered HNSW, high-sel
HNSW filtered, low-sel Parquet-native). Remove stale DataFusion 51 API
notes and benchmark section.
* fix(planner): address review — schema mapping and NaN handling
- Use name-based column lookup from lookup_provider schema into full
scan schema instead of positional filtering. Prevents silent column
mismatches if the two schemas have different orderings.
- Skip NaN distances before pushing to the top-k heap. Prevents NaN
rows from sinking to the bottom and never being evicted.
* docs: update stale comments for Parquet-native path
Update planner.rs header to describe the three execution paths
(unfiltered, high-sel HNSW filtered, low-sel Parquet-native).
Update registry.rs brute_force_selectivity_threshold doc to describe
the scan_provider-based path instead of the old brute-force approach.1 parent 4605f0e commit 80059cc
4 files changed
Lines changed: 520 additions & 410 deletions
0 commit comments