Skip to content

Commit 67ddf6d

Browse files
authored
feat: rename vector_usearch to vector_search_vector, return full rows (#22)
* fix(rule): preserve hidden distance for sort Keep _distance in an inner projection when ORDER BY uses a vector\ndistance expression that is not part of the final select list.\n\nThis fixes split-provider execution for queries like SELECT id ORDER\nBY l2_distance(vector, ARRAY[...]) LIMIT k while preserving the final\noutput schema. Add an execution test for the direct ORDER BY shape to\ncover the production case. * style(rule): format hidden distance rewrite * test(rule): cover computed sort projections * refactor: make usearch_search, attach_distances, provider_key_col_idx pub(crate) These helpers are needed by the new vector_search_vector UDTF to reuse the same HNSW search → fetch → attach pattern as the ORDER BY path. * feat: rename vector_usearch to vector_search_vector, return full rows Replace the old vector_usearch UDTF that returned only (key, _distance) with vector_search_vector that returns all table columns plus _distance. New signature: vector_search_vector('conn.schema.table', 'column', ARRAY[...], k) The UDTF reuses usearch_search, attach_distances, and provider_key_col_idx from the planner module to follow the same HNSW search → fetch_by_keys → attach_distances pattern as the ORDER BY execution path. * docs: update README for vector_search_vector UDTF Update UDTF section to reflect the new vector_search_vector signature and full-row return schema. Update module structure reference. * fix: use f64 precision for UDTF query vectors, add UDTF tests Parse query vectors as f64 to match the optimizer path's precision, avoiding silent accuracy loss for F64-quantized indexes. Add 5 tests for vector_search_vector: basic happy path, projection pushdown, bad table ref error, registry miss error, k > dataset size.
1 parent bfc0d10 commit 67ddf6d

6 files changed

Lines changed: 260 additions & 123 deletions

File tree

README.md

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -144,17 +144,31 @@ See [Adaptive filtering](#adaptive-filtering) for details on how the execution p
144144

145145
### UDTF path
146146

147-
For runtime query vectors, complex joins, or explicit over-fetch control:
147+
`vector_search_vector` provides an explicit table function for ANN search, returning all table columns plus `_distance`:
148148

149149
```sql
150-
SELECT vs.key, vs._distance, d.title
151-
FROM vector_usearch('my_table', ARRAY[0.1, 0.2, ...], 20) vs
152-
JOIN my_table d ON d.id = vs.key
153-
ORDER BY vs._distance ASC
154-
LIMIT 10
150+
SELECT id, title, _distance
151+
FROM vector_search_vector('conn.schema.table', 'column', ARRAY[0.1, 0.2, ...], 10)
152+
ORDER BY _distance ASC
155153
```
156154

157-
The UDTF always calls `index.search()` directly — no filter absorption. Apply `WHERE` on the outer query to post-filter.
155+
| Argument | Type | Description |
156+
|---|---|---|
157+
| table | string literal | Dot-separated table reference: `'conn.schema.table'` |
158+
| column | string literal | Vector column with a registered index |
159+
| query | `ARRAY[...]` literal | Query vector |
160+
| k | integer | Number of nearest neighbors to return |
161+
162+
The UDTF calls `resolve()` (sync, cache-only) on the registry — the index must already be loaded before the query is planned. It always calls `index.search()` directly — no filter absorption. Apply `WHERE` on the outer query to post-filter.
163+
164+
```sql
165+
-- With filtering, aggregation, etc.
166+
SELECT category, COUNT(*) AS cnt, AVG(_distance) AS avg_dist
167+
FROM vector_search_vector('conn.schema.table', 'embedding', ARRAY[...], 50)
168+
WHERE category = 'nlp'
169+
GROUP BY category
170+
ORDER BY avg_dist
171+
```
158172

159173
### Tuning
160174

@@ -205,7 +219,7 @@ src/
205219
rule.rs — USearchRule: optimizer rewrite rule
206220
planner.rs — USearchExecPlanner, USearchExec: physical execution
207221
udf.rs — l2_distance, cosine_distance, negative_dot_product scalar UDFs
208-
udtf.rs — vector_usearch table function
222+
udtf.rs — vector_search_vector table function
209223
lookup.rs — PointLookupProvider trait + HashKeyProvider
210224
keys.rs — DatasetLayout, pack_key/unpack_key key encoding
211225
@@ -292,6 +306,6 @@ Tests cover optimizer rule matching/rejection, end-to-end execution through both
292306
| Limitation | Notes |
293307
|---|---|
294308
| Stacked `Filter` nodes | Only one `Filter -> TableScan` layer is absorbed. `Filter -> Filter -> TableScan` falls back to exact execution. DataFusion typically combines multiple WHERE conditions into a single Filter, so this rarely occurs. |
295-
| Runtime query vectors | The query vector must be a compile-time literal (`ARRAY[0.1, ...]`). Column references or subquery results are not rewritten. Use the UDTF path for runtime vectors. |
309+
| Runtime query vectors | The query vector must be a compile-time literal (`ARRAY[0.1, ...]`). Column references or subquery results are not rewritten. Use `vector_search_vector` for explicit ANN queries. |
296310
| `ef_search` per-query | `expansion_search` is global to the index instance. Per-query adjustment is not supported. |
297311
| No DELETE / compaction | USearch soft-deletes entries but requires a full rebuild to reclaim space. |

src/lib.rs

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ pub use registry::{
8080
};
8181
pub use rule::USearchRule;
8282
pub use udf::{cosine_distance_udf, l2_distance_udf, negative_dot_product_udf};
83-
pub use udtf::USearchUDTF;
83+
pub use udtf::VectorSearchVectorUDTF;
8484

8585
#[cfg(feature = "parquet-provider")]
8686
pub use parquet_provider::ParquetLookupProvider;
@@ -99,7 +99,8 @@ use datafusion::prelude::SessionContext;
9999
/// - `l2_distance(col, query)` — squared Euclidean distance (L2sq)
100100
/// - `cosine_distance(col, query)` — cosine distance
101101
/// - `negative_dot_product(col, query)` — negated inner product
102-
/// - `vector_usearch(table, query, k)` — explicit ANN table function
102+
/// - `vector_search_vector('conn.schema.table', 'column', ARRAY[...], k)`
103+
/// — explicit ANN table function returning full rows + `_distance`
103104
/// (cache-only for async-backed resolvers; does not trigger async loads)
104105
/// - [`USearchRule`] — optimizer rewrite rule
105106
///
@@ -110,10 +111,8 @@ pub fn register_all(ctx: &SessionContext, registry: Arc<dyn VectorIndexResolver>
110111
ctx.register_udf(ScalarUDF::new_from_impl(cosine_distance_udf()));
111112
ctx.register_udf(ScalarUDF::new_from_impl(negative_dot_product_udf()));
112113
ctx.register_udtf(
113-
"vector_usearch",
114-
// `vector_usearch()` is synchronous and therefore cache-only for
115-
// async-backed resolvers.
116-
Arc::new(USearchUDTF::new(registry.clone())),
114+
"vector_search_vector",
115+
Arc::new(VectorSearchVectorUDTF::new(registry.clone())),
117116
);
118117
ctx.add_optimizer_rule(Arc::new(USearchRule::new(registry)));
119118
Ok(())

src/planner.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -616,7 +616,7 @@ async fn adaptive_filtered_execute(
616616

617617
/// Call `index.search` with the native scalar type appropriate for the column.
618618
/// Converts the usearch error into a `DataFusionError::Execution`.
619-
fn usearch_search(
619+
pub(crate) fn usearch_search(
620620
index: &usearch::Index,
621621
query_f64: &[f64],
622622
k: usize,
@@ -797,7 +797,7 @@ fn compute_raw_distance_f64(v: &[f64], q: &[f64], dist_type: &DistanceType) -> f
797797
/// Extract the distance from a single row of a vector column.
798798
///
799799
/// Index of the key column in the lookup provider schema.
800-
fn provider_key_col_idx(registered: &crate::registry::RegisteredTable) -> Result<usize> {
800+
pub(crate) fn provider_key_col_idx(registered: &crate::registry::RegisteredTable) -> Result<usize> {
801801
registered
802802
.lookup_provider
803803
.schema()
@@ -813,7 +813,7 @@ fn provider_key_col_idx(registered: &crate::registry::RegisteredTable) -> Result
813813
// ── Distance attachment ───────────────────────────────────────────────────────
814814

815815
/// Append a `_distance: Float32` column to each batch.
816-
fn attach_distances(
816+
pub(crate) fn attach_distances(
817817
batches: Vec<RecordBatch>,
818818
key_col_idx: usize,
819819
key_to_dist: &HashMap<u64, f32>,

src/udf.rs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@
33
// Each takes (vector_col: FixedSizeList<Float32>, query: Array/Scalar) and
44
// returns a Float32 distance per row.
55
//
6-
// These are identical to the vector_search UDFs but kept in this module so
7-
// vector_usearch is fully self-contained (no dependency on vector_search).
6+
// These are kept in this module alongside the UDTF and optimizer rule.
87

98
use std::any::Any;
109
use std::hash::{Hash, Hasher};

0 commit comments

Comments
 (0)