Skip to content

Commit 4397342

Browse files
authored
perf: optimize filtered vector search pre-scan and low-selectivity path (#12)
* perf(planner): narrow pre-scan projection to _key + filter columns The pre-scan previously projected all columns except the vector column, reading unused data (e.g. id, url, title, sha, raw) that the pre-scan never uses. It only needs _key (to collect valid keys) and columns referenced by the WHERE clause filters. Compute referenced columns from filter expressions, build a minimal projection, and compile separate physical filters against the projected schema so column indices match the narrower batch layout. * perf(planner): use USearch index.get() for low-selectivity path Replace the parquet-native full scan with direct vector retrieval from the USearch index. The index stores vectors alongside the HNSW graph, so index.get(key) retrieves them in O(1) per key. Previously, the low-selectivity path scanned the entire Parquet file including the vector column (e.g. 6.95GB for 1.2M rows) just to compute distances for the few rows matching the WHERE clause. Now it retrieves vectors only for valid_keys collected during the pre-scan, computes distances, maintains a top-k heap, then fetches result rows from the lookup provider. This eliminates the full_scan DataSourceExec at runtime for filtered queries. The parquet-native code is retained but unused, pending removal after production validation.
1 parent f16ec0c commit 4397342

1 file changed

Lines changed: 262 additions & 313 deletions

File tree

0 commit comments

Comments
 (0)