feat: add lance_vector_search SQL table-valued function#450
Open
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
Open
feat: add lance_vector_search SQL table-valued function#450wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
Conversation
Adds a `lance_vector_search` table-valued function exposing Lance
ANN/kNN search to Spark SQL.
Relation to the existing KNN path: vector search is already reachable
via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST`
(`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`. That
path requires (1) a compile-time dependency on `org.lance.ipc.Query`,
(2) manual JSON serialization of the query, and (3) a JVM caller —
which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI
tools, dbt models, notebooks driven by SQL). The TVF wraps the same
`CONFIG_NEAREST` mechanism behind a SQL-native interface with named
arguments (Spark 3.5+):
SELECT id, category
FROM lance_vector_search(
table => 'lance.db.items',
column => 'embedding',
query => array(0.1f, 0.2f, ...),
k => 10,
metric => 'cosine')
WHERE category = 'books';
Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs
(`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape
hatch) are all parameters of the function.
Tests run in brute-force mode (`use_index=false`) so this PR has no
dependency on the vector-index DDL feature — the two PRs can be
reviewed and merged independently.
Known limitation: the virtual `_distance` column produced by Lance is
present in the returned Arrow batches but not in the relation schema,
so `SELECT _distance` / `ORDER BY _distance` will fail to resolve.
This is addressed in a follow-up PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 18, 2026
`SparkSession.close()` throws checked `IOException` on Spark 4.x but not on 3.x. The base test file is added to every version module's test source set, so the missing `throws` clause broke `make test` on Spark 4.0/4.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8e9e9d7 to
c72b6df
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is one of three PRs that supersede #436, splitting it per reviewer feedback.
Summary
Adds a `lance_vector_search` table-valued function exposing Lance ANN/kNN search to Spark SQL.
Relation to the existing KNN path
Vector search is already reachable via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST` (`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`:
```java
spark.read().format("lance")
.option("nearest", QueryUtils.queryToString(queryBuilder.build()))
.option("path", uri).load();
```
That path requires (1) a compile-time dependency on `org.lance.ipc.Query`, (2) manual JSON serialization of the query, and (3) a JVM caller — which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI tools, dbt models, notebooks driven by SQL). The TVF wraps the same `CONFIG_NEAREST` mechanism behind a SQL-native interface with named arguments (Spark 3.5+):
```sql
SELECT id, category
FROM lance_vector_search(
table => 'lance.db.items',
column => 'embedding',
query => array(0.1f, 0.2f, ...),
k => 10,
metric => 'cosine')
WHERE category = 'books';
```
Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs (`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape hatch) are all parameters of the function.
Independence
Tests run in brute-force mode (`use_index=false`) so this PR has no dependency on the vector-index DDL feature — the two PRs can be reviewed and merged in any order.
Known limitation in this PR
The virtual `_distance` column produced by Lance is present in the returned Arrow batches but not in the relation schema, so `SELECT _distance` / `ORDER BY _distance` will fail to resolve. This is addressed in a follow-up PR (#451 — draft, depends on this one).
Test plan
🤖 Generated with Claude Code