feat: add lance_vector_search SQL table-valued function by wombatu-kun · Pull Request #450 · lance-format/lance-spark

wombatu-kun · 2026-04-18T10:35:54Z

This is one of three PRs that supersede #436, splitting it per reviewer feedback.

Summary

Adds a `lance_vector_search` table-valued function exposing Lance ANN/kNN search to Spark SQL.

Relation to the existing KNN path

Vector search is already reachable via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST` (`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`:

```java
spark.read().format("lance")
.option("nearest", QueryUtils.queryToString(queryBuilder.build()))
.option("path", uri).load();
```

That path requires (1) a compile-time dependency on `org.lance.ipc.Query`, (2) manual JSON serialization of the query, and (3) a JVM caller — which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI tools, dbt models, notebooks driven by SQL). The TVF wraps the same `CONFIG_NEAREST` mechanism behind a SQL-native interface with named arguments (Spark 3.5+):

```sql
SELECT id, category
FROM lance_vector_search(
table => 'lance.db.items',
column => 'embedding',
query => array(0.1f, 0.2f, ...),
k => 10,
metric => 'cosine')
WHERE category = 'books';
```

Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs (`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape hatch) are all parameters of the function.

Independence

Tests run in brute-force mode (`use_index=false`) so this PR has no dependency on the vector-index DDL feature — the two PRs can be reviewed and merged in any order.

Known limitation in this PR

The virtual `_distance` column produced by Lance is present in the returned Arrow batches but not in the relation schema, so `SELECT _distance` / `ORDER BY _distance` will fail to resolve. This is addressed in a follow-up PR (#451 — draft, depends on this one).

Test plan

`make test SPARK_VERSION=3.5 -Dtest=LanceVectorSearchTest` — 5 integration tests pass
`make test SPARK_VERSION=3.5 -Dtest=VectorSearchArgParsingTest` — 6 unit tests pass
`make lint` — checkstyle + spotless clean
`make test-all` — please run on CI to verify all version modules

🤖 Generated with Claude Code

Adds a `lance_vector_search` table-valued function exposing Lance ANN/kNN search to Spark SQL. Relation to the existing KNN path: vector search is already reachable via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST` (`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`. That path requires (1) a compile-time dependency on `org.lance.ipc.Query`, (2) manual JSON serialization of the query, and (3) a JVM caller — which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI tools, dbt models, notebooks driven by SQL). The TVF wraps the same `CONFIG_NEAREST` mechanism behind a SQL-native interface with named arguments (Spark 3.5+): SELECT id, category FROM lance_vector_search( table => 'lance.db.items', column => 'embedding', query => array(0.1f, 0.2f, ...), k => 10, metric => 'cosine') WHERE category = 'books'; Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs (`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape hatch) are all parameters of the function. Tests run in brute-force mode (`use_index=false`) so this PR has no dependency on the vector-index DDL feature — the two PRs can be reviewed and merged independently. Known limitation: the virtual `_distance` column produced by Lance is present in the returned Arrow batches but not in the relation schema, so `SELECT _distance` / `ORDER BY _distance` will fail to resolve. This is addressed in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`SparkSession.close()` throws checked `IOException` on Spark 4.x but not on 3.x. The base test file is added to every version module's test source set, so the missing `throws` clause broke `make test` on Spark 4.0/4.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the enhancement New feature or request label Apr 18, 2026

This was referenced Apr 18, 2026

feat: surface _distance virtual column from vector search #451

Draft

feat: vector search SQL extension with _distance column #436

Closed

wombatu-kun force-pushed the vector-search-tvf branch from 8e9e9d7 to c72b6df Compare April 18, 2026 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lance_vector_search SQL table-valued function#450

feat: add lance_vector_search SQL table-valued function#450
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-tvf

wombatu-kun commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wombatu-kun commented Apr 18, 2026

Summary

Relation to the existing KNN path

Independence

Known limitation in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant