Commit 4397342
authored
perf: optimize filtered vector search pre-scan and low-selectivity path (#12)
* perf(planner): narrow pre-scan projection to _key + filter columns
The pre-scan previously projected all columns except the vector column,
reading unused data (e.g. id, url, title, sha, raw) that the pre-scan
never uses. It only needs _key (to collect valid keys) and columns
referenced by the WHERE clause filters.
Compute referenced columns from filter expressions, build a minimal
projection, and compile separate physical filters against the projected
schema so column indices match the narrower batch layout.
* perf(planner): use USearch index.get() for low-selectivity path
Replace the parquet-native full scan with direct vector retrieval from
the USearch index. The index stores vectors alongside the HNSW graph,
so index.get(key) retrieves them in O(1) per key.
Previously, the low-selectivity path scanned the entire Parquet file
including the vector column (e.g. 6.95GB for 1.2M rows) just to
compute distances for the few rows matching the WHERE clause. Now it
retrieves vectors only for valid_keys collected during the pre-scan,
computes distances, maintains a top-k heap, then fetches result rows
from the lookup provider.
This eliminates the full_scan DataSourceExec at runtime for filtered
queries. The parquet-native code is retained but unused, pending
removal after production validation.1 parent f16ec0c commit 4397342
1 file changed
Lines changed: 262 additions & 313 deletions
0 commit comments