Skip to content

feat: add lance_vector_search SQL table-valued function#450

Open
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-tvf
Open

feat: add lance_vector_search SQL table-valued function#450
wombatu-kun wants to merge 2 commits intolance-format:mainfrom
wombatu-kun:vector-search-tvf

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

This is one of three PRs that supersede #436, splitting it per reviewer feedback.

Summary

Adds a `lance_vector_search` table-valued function exposing Lance ANN/kNN search to Spark SQL.

Relation to the existing KNN path

Vector search is already reachable via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST` (`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`:

```java
spark.read().format("lance")
.option("nearest", QueryUtils.queryToString(queryBuilder.build()))
.option("path", uri).load();
```

That path requires (1) a compile-time dependency on `org.lance.ipc.Query`, (2) manual JSON serialization of the query, and (3) a JVM caller — which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI tools, dbt models, notebooks driven by SQL). The TVF wraps the same `CONFIG_NEAREST` mechanism behind a SQL-native interface with named arguments (Spark 3.5+):

```sql
SELECT id, category
FROM lance_vector_search(
table => 'lance.db.items',
column => 'embedding',
query => array(0.1f, 0.2f, ...),
k => 10,
metric => 'cosine')
WHERE category = 'books';
```

Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs (`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape hatch) are all parameters of the function.

Independence

Tests run in brute-force mode (`use_index=false`) so this PR has no dependency on the vector-index DDL feature — the two PRs can be reviewed and merged in any order.

Known limitation in this PR

The virtual `_distance` column produced by Lance is present in the returned Arrow batches but not in the relation schema, so `SELECT _distance` / `ORDER BY _distance` will fail to resolve. This is addressed in a follow-up PR (#451 — draft, depends on this one).

Test plan

  • `make test SPARK_VERSION=3.5 -Dtest=LanceVectorSearchTest` — 5 integration tests pass
  • `make test SPARK_VERSION=3.5 -Dtest=VectorSearchArgParsingTest` — 6 unit tests pass
  • `make lint` — checkstyle + spotless clean
  • `make test-all` — please run on CI to verify all version modules

🤖 Generated with Claude Code

Adds a `lance_vector_search` table-valued function exposing Lance
ANN/kNN search to Spark SQL.

Relation to the existing KNN path: vector search is already reachable
via the DataFrame API today, by setting `LanceSparkReadOptions.CONFIG_NEAREST`
(`"nearest"` option) to a JSON-serialized `org.lance.ipc.Query`. That
path requires (1) a compile-time dependency on `org.lance.ipc.Query`,
(2) manual JSON serialization of the query, and (3) a JVM caller —
which blocks pure-SQL workflows (Spark Thrift Server / Connect, BI
tools, dbt models, notebooks driven by SQL). The TVF wraps the same
`CONFIG_NEAREST` mechanism behind a SQL-native interface with named
arguments (Spark 3.5+):

  SELECT id, category
  FROM lance_vector_search(
         table  => 'lance.db.items',
         column => 'embedding',
         query  => array(0.1f, 0.2f, ...),
         k      => 10,
         metric => 'cosine')
  WHERE category = 'books';

Distance metrics (`l2`, `cosine`, `dot`, `hamming`), IVF/PQ/HNSW knobs
(`nprobes`, `refine_factor`, `ef`), and `use_index` (brute-force escape
hatch) are all parameters of the function.

Tests run in brute-force mode (`use_index=false`) so this PR has no
dependency on the vector-index DDL feature — the two PRs can be
reviewed and merged independently.

Known limitation: the virtual `_distance` column produced by Lance is
present in the returned Arrow batches but not in the relation schema,
so `SELECT _distance` / `ORDER BY _distance` will fail to resolve.
This is addressed in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SparkSession.close()` throws checked `IOException` on Spark 4.x but
not on 3.x. The base test file is added to every version module's test
source set, so the missing `throws` clause broke `make test` on Spark
4.0/4.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant