You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(rule): keep index for k-NN that returns metadata, fall back when vector is projected (#25)
A k-NN query that projects the indexed vector column (SELECT *, or
SELECT id, embedding) crashed with a post-optimizer schema-mismatch when an
index was present: the passthrough branch built its output from the index
node's columns (addressing key + non-vector columns), which can't include the
vector, so the rewritten plan's schema differed from the original and
DataFusion's invariant check aborted the query.
The fix is output-aware. The rule now also anchors on a Projection sitting
directly over a passthrough k-NN Sort and drives the rewrite from that outer
projection's columns — the query's real output:
- vector NOT in output (e.g. SELECT id ... ORDER BY l2_distance(emb, ...),
the common "nearest ids" query) -> every output column is producible from
the node, so the index is still used.
- vector IN output (SELECT *, SELECT id, embedding) -> the rewrite can't
produce it, so the rule declines and the query falls back to exact
brute-force search (correct, like the existing DESC / metric-mismatch
fallbacks) instead of crashing.
This keeps the metadata-only k-NN path on the index (no regression) while
fixing the crash. A code comment records the rejected alternative (have
USearchExec reconstruct the vector via index.get) and why: it would make the
index a second source of returned vectors that must byte-match the source,
which breaks under F16 quantization.
Regression tests model production (lookup schema excludes the vector column,
which the existing tests' provider included, masking the bug). README documents
the fallback.
Fixes #508
Copy file name to clipboardExpand all lines: README.md
+23-9Lines changed: 23 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ A DataFusion extension that integrates [USearch](https://github.com/unum-cloud/u
4
4
5
5
Queries matching the `ORDER BY distance_fn(col, query) LIMIT k` pattern are **transparently rewritten** by an optimizer rule into a native USearch index call — no query rewrite needed from the caller. `WHERE` clause filters are handled adaptively: high-selectivity filters use USearch's in-graph predicate API; low-selectivity filters bypass HNSW entirely and scan the data directly.
6
6
7
-
**DataFusion:**52.2 **USearch:** 2.24
7
+
**DataFusion:**53 **USearch:** 2.24
8
8
9
9
---
10
10
@@ -230,20 +230,33 @@ tests/
230
230
231
231
### Optimizer rewrite
232
232
233
-
The rule (`rule.rs`) matches two logical plan shapes:
233
+
The rule (`rule.rs`) matches the `Sort(fetch=k)` over a `TableScan`, with an
234
+
optional `Projection` between them and an optional `Filter` directly above the
235
+
scan:
234
236
235
237
```
236
238
Sort(fetch=k, ORDER BY dist ASC)
237
-
Projection([..., distance_fn(col, lit) AS dist, ...])
238
-
TableScan(name)
239
-
240
-
Sort(fetch=k, ORDER BY dist ASC)
241
-
Projection([..., distance_fn(col, lit) AS dist, ...])
242
-
Filter(predicate)
239
+
[ Projection([..., distance_fn(col, lit) AS dist, ...]) ] ← optional
240
+
[ Filter(predicate) ] ← optional
243
241
TableScan(name)
244
242
```
245
243
246
-
Preconditions: sort is `ASC`, distance UDF matches index metric, table is registered, query vector is a literal. When the rule fires, it replaces the inner nodes with a `USearchNode` leaf carrying: table name, vector column, query vector, k, distance type, and absorbed filter predicates. The `Sort` node is preserved above for final ordering.
244
+
DataFusion omits the `Projection` for `SELECT *` (and for any SELECT whose
245
+
columns come straight from the scan), so the `Sort` can sit directly on the
246
+
`TableScan`.
247
+
248
+
Preconditions: sort is `ASC`, distance UDF matches index metric, table is
249
+
registered, query vector is a literal. When the rule fires, it replaces the inner
250
+
nodes with a `USearchNode` leaf carrying: table name, vector column, query
251
+
vector, k, distance type, and absorbed filter predicates. The `Sort` node is
252
+
preserved above for final ordering.
253
+
254
+
**Schema preservation:** an optimizer rule must not change the plan's output
255
+
schema. The `USearchNode` produces only what the `lookup_provider` can fetch by
256
+
key (addressing key + non-vector columns) plus `_distance` — it cannot produce
257
+
the indexed vector column. If the matched `Sort`'s output would include the
258
+
vector column (e.g. `SELECT *`), the rule declines and the query falls back to
259
+
exact execution rather than emitting a schema-incompatible plan.
247
260
248
261
Physical planning (`planner.rs`) translates `USearchNode` into `USearchExec`, a physical plan node that executes the actual search.
249
262
@@ -305,6 +318,7 @@ Tests cover optimizer rule matching/rejection, end-to-end execution through both
305
318
306
319
| Limitation | Notes |
307
320
|---|---|
321
+
| Projecting the indexed vector column | A k-NN query whose output includes the vector column itself (e.g. `SELECT *`, or `SELECT id, vector`) falls back to exact execution. The `lookup_provider` does not store the vector column (see [registration](#register-providers-and-set-up-the-sessioncontext)), so the rewrite cannot reproduce it. Project the metadata columns and the distance instead. |
308
322
| Stacked `Filter` nodes | Only one `Filter -> TableScan` layer is absorbed. `Filter -> Filter -> TableScan` falls back to exact execution. DataFusion typically combines multiple WHERE conditions into a single Filter, so this rarely occurs. |
309
323
| Runtime query vectors | The query vector must be a compile-time literal (`ARRAY[0.1, ...]`). Column references or subquery results are not rewritten. Use `vector_search_vector` for explicit ANN queries. |
310
324
|`ef_search` per-query |`expansion_search` is global to the index instance. Per-query adjustment is not supported. |
0 commit comments