Skip to content

Commit 80059cc

Browse files
authored
feat: Parquet-native adaptive filter for vector search (#7)
* feat(planner): Parquet-native adaptive filter with two-scan approach Pre-scan uses projection (scalar + _key only, no vector column) with filter pushdown for selectivity estimation. Low-selectivity path (<= 5%) bypasses USearch and SQLite entirely — streams full Parquet scan, evaluates filters inline, computes distances, and maintains a top-k heap (ScoredRow) to return results directly. High-selectivity path unchanged: HNSW filtered_search → SQLite fetch. - Remove old brute-force path and heap_select_top_k helper - Add parquet_native_execute() for the low-selectivity path - Add ScoredRow struct with Ord impl for max-heap eviction * refactor(planner): expose scan plans as USearchExec children Report provider_scan and full_scan as children so DataFusion's physical optimizer can traverse and optimize the scan plans. with_new_children correctly replaces whichever scans are present. * test(execution): add parquet-native path coverage Force the low-selectivity path via brute_force_selectivity_threshold=1.0 with a lookup provider that excludes the vector column (matching the real Parquet+SQLite deployment). Four new tests: - WHERE exclusion with distance ordering - Equality filter with distance ordering - LIMIT < matching rows (heap eviction) - WHERE with no matches (empty result) * docs: rewrite README for split-provider architecture Update usage examples for the scan_provider + lookup_provider API. Document the three-path adaptive filtering (unfiltered HNSW, high-sel HNSW filtered, low-sel Parquet-native). Remove stale DataFusion 51 API notes and benchmark section. * fix(planner): address review — schema mapping and NaN handling - Use name-based column lookup from lookup_provider schema into full scan schema instead of positional filtering. Prevents silent column mismatches if the two schemas have different orderings. - Skip NaN distances before pushing to the top-k heap. Prevents NaN rows from sinking to the bottom and never being evicted. * docs: update stale comments for Parquet-native path Update planner.rs header to describe the three execution paths (unfiltered, high-sel HNSW filtered, low-sel Parquet-native). Update registry.rs brute_force_selectivity_threshold doc to describe the scan_provider-based path instead of the old brute-force approach.
1 parent 4605f0e commit 80059cc

4 files changed

Lines changed: 520 additions & 410 deletions

File tree

0 commit comments

Comments
 (0)