Skip to content

Commit 3dea26a

Browse files
authored
feat: SEARCH, VECTOR_SEARCH and HYBRID_SEARCH table functions (#582)
Adds Spark SQL table functions for Lance search: - Registers `VECTOR_SEARCH`, `SEARCH`, and `HYBRID_SEARCH` through the Spark SQL extension. - Executes vector and full-text search through Lance namespace `queryTable`; hybrid search fuses namespace query results in Spark. - Removes the legacy `nearest` read-option scan path and fails fast with a `VECTOR_SEARCH` migration message. - Adds search docs, focused Spark tests, Docker integration coverage, and a targeted GitHub Actions workflow for directory namespace search tests. Validated with focused Spark tests, Spark 4 compile, Spotless, Docker search CI, full Spark CI, and `git diff --check`.
1 parent bd9a009 commit 3dea26a

49 files changed

Lines changed: 3477 additions & 511 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/spark-search.yml

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Licensed under the Apache License, Version 2.0 (the "License");
2+
# you may not use this file except in compliance with the License.
3+
# You may obtain a copy of the License at
4+
#
5+
# http://www.apache.org/licenses/LICENSE-2.0
6+
#
7+
# Unless required by applicable law or agreed to in writing, software
8+
# distributed under the License is distributed on an "AS IS" BASIS,
9+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10+
# See the License for the specific language governing permissions and
11+
# limitations under the License.
12+
13+
name: Spark Search Docker
14+
15+
on:
16+
pull_request:
17+
types:
18+
- opened
19+
- synchronize
20+
- ready_for_review
21+
- reopened
22+
paths:
23+
- ".github/workflows/spark-search.yml"
24+
- "Makefile"
25+
- "docker/**"
26+
- "integration-tests/**"
27+
- "lance-spark-base_2.12/src/main/java/org/lance/spark/search/**"
28+
- "lance-spark-base_2.12/src/main/scala/org/lance/spark/search/**"
29+
- "lance-spark-base_2.12/src/test/java/org/lance/spark/search/**"
30+
- "lance-spark-*/src/main/scala/org/lance/spark/extensions/**"
31+
- "pom.xml"
32+
- "*/pom.xml"
33+
workflow_dispatch:
34+
inputs:
35+
spark-version:
36+
description: "Spark version to test"
37+
required: true
38+
default: "3.5"
39+
scala-version:
40+
description: "Scala version to test"
41+
required: true
42+
default: "2.13"
43+
backends:
44+
description: "Comma-separated test backends: local or local,rest-dir"
45+
required: true
46+
default: "local,rest-dir"
47+
rest-uri:
48+
description: "Optional REST namespace URI. If omitted, tests start a local REST directory namespace."
49+
required: false
50+
default: ""
51+
rest-database:
52+
description: "Optional database header value for an external REST namespace"
53+
required: false
54+
default: ""
55+
docker-run-args:
56+
description: "Extra docker run args for docker-test"
57+
required: false
58+
default: ""
59+
60+
permissions:
61+
contents: read
62+
63+
concurrency:
64+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
65+
cancel-in-progress: true
66+
67+
env:
68+
SPARK_VERSION: ${{ github.event.inputs['spark-version'] || '3.5' }}
69+
SCALA_VERSION: ${{ github.event.inputs['scala-version'] || '2.13' }}
70+
SEARCH_TEST_BACKENDS: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.backends || 'local,rest-dir' }}
71+
SEARCH_PYTEST_CMD: >-
72+
pytest /home/lance/tests/test_lance_spark.py::TestDQLSearchTableFunctions
73+
-v --timeout=180
74+
75+
jobs:
76+
search-docker-test:
77+
name: Search Docker Test
78+
runs-on: ubuntu-24.04
79+
timeout-minutes: 90
80+
steps:
81+
- name: Checkout
82+
uses: actions/checkout@v4
83+
with:
84+
ref: ${{ github.event.pull_request.head.sha || github.sha }}
85+
- name: Set up Java
86+
uses: actions/setup-java@v4
87+
with:
88+
distribution: temurin
89+
java-version: 17
90+
cache: "maven"
91+
- name: Resolve Docker build args
92+
id: docker-args
93+
run: |
94+
make print-docker-build-args SPARK_VERSION=${SPARK_VERSION} SCALA_VERSION=${SCALA_VERSION} >> $GITHUB_OUTPUT
95+
- name: Set up Docker Buildx
96+
uses: docker/setup-buildx-action@v3
97+
- name: Build test-base image (cached)
98+
uses: docker/build-push-action@v6
99+
with:
100+
context: docker
101+
file: docker/Dockerfile.test-base
102+
load: true
103+
tags: lance-spark-test-base:${{ env.SPARK_VERSION }}_${{ env.SCALA_VERSION }}
104+
build-args: |
105+
SPARK_DOWNLOAD_VERSION=${{ steps.docker-args.outputs.spark-download-version }}
106+
SPARK_MAJOR_VERSION=${{ env.SPARK_VERSION }}
107+
SCALA_VERSION=${{ env.SCALA_VERSION }}
108+
PY4J_VERSION=${{ steps.docker-args.outputs.py4j-version }}
109+
SPARK_SCALA_SUFFIX=${{ steps.docker-args.outputs.spark-scala-suffix }}
110+
cache-from: type=gha,scope=search-test-base-${{ env.SPARK_VERSION }}_${{ env.SCALA_VERSION }}
111+
cache-to: type=gha,mode=max,scope=search-test-base-${{ env.SPARK_VERSION }}_${{ env.SCALA_VERSION }}
112+
- name: Build bundle
113+
run: make bundle SPARK_VERSION=${SPARK_VERSION} SCALA_VERSION=${SCALA_VERSION}
114+
- name: Build test image
115+
run: |
116+
make docker-build-test \
117+
SPARK_VERSION=${SPARK_VERSION} \
118+
SCALA_VERSION=${SCALA_VERSION} \
119+
LANCE_NAMESPACE_IMPL_VERSION=${{ steps.docker-args.outputs.lance-namespace-impl-version }}
120+
- name: Run directory namespace search tests
121+
if: ${{ contains(env.SEARCH_TEST_BACKENDS, 'local') }}
122+
run: |
123+
make docker-test \
124+
SPARK_VERSION=${SPARK_VERSION} \
125+
SCALA_VERSION=${SCALA_VERSION} \
126+
TEST_BACKENDS=local \
127+
PYTEST_CMD="${SEARCH_PYTEST_CMD}"
128+
- name: Resolve REST namespace URI
129+
id: rest
130+
if: ${{ contains(env.SEARCH_TEST_BACKENDS, 'rest-dir') }}
131+
env:
132+
INPUT_REST_URI: ${{ github.event.inputs['rest-uri'] }}
133+
INPUT_DOCKER_RUN_ARGS: ${{ github.event.inputs['docker-run-args'] }}
134+
run: |
135+
rest_uri="${INPUT_REST_URI}"
136+
docker_run_args="${INPUT_DOCKER_RUN_ARGS}"
137+
start_rest_dir="false"
138+
rest_dir_root=""
139+
rest_dir_port=""
140+
141+
if [ -z "${rest_uri}" ]; then
142+
rest_dir_port="10024"
143+
rest_dir_root="/home/lance/rest-data"
144+
rest_uri="http://127.0.0.1:${rest_dir_port}"
145+
start_rest_dir="true"
146+
fi
147+
148+
echo "uri=${rest_uri}" >> "$GITHUB_OUTPUT"
149+
echo "start_rest_dir=${start_rest_dir}" >> "$GITHUB_OUTPUT"
150+
echo "rest_dir_root=${rest_dir_root}" >> "$GITHUB_OUTPUT"
151+
echo "rest_dir_port=${rest_dir_port}" >> "$GITHUB_OUTPUT"
152+
{
153+
echo "docker_run_args<<EOF"
154+
echo "${docker_run_args}"
155+
echo "EOF"
156+
} >> "$GITHUB_OUTPUT"
157+
- name: Run REST directory namespace search tests
158+
if: ${{ contains(env.SEARCH_TEST_BACKENDS, 'rest-dir') }}
159+
env:
160+
LANCE_SPARK_REST_URI: ${{ steps.rest.outputs.uri }}
161+
LANCE_SPARK_REST_API_KEY: ${{ secrets.LANCE_SPARK_REST_API_KEY }}
162+
LANCE_SPARK_REST_DATABASE: ${{ github.event.inputs['rest-database'] }}
163+
LANCE_SPARK_START_REST_DIR: ${{ steps.rest.outputs.start_rest_dir }}
164+
LANCE_SPARK_REST_DIR_ROOT: ${{ steps.rest.outputs.rest_dir_root }}
165+
LANCE_SPARK_REST_DIR_PORT: ${{ steps.rest.outputs.rest_dir_port }}
166+
DOCKER_RUN_ARGS: ${{ steps.rest.outputs.docker_run_args }}
167+
run: |
168+
make docker-test \
169+
SPARK_VERSION=${SPARK_VERSION} \
170+
SCALA_VERSION=${SCALA_VERSION} \
171+
TEST_BACKENDS=rest-dir \
172+
LANCE_SPARK_REST_URI="${LANCE_SPARK_REST_URI}" \
173+
LANCE_SPARK_REST_API_KEY="${LANCE_SPARK_REST_API_KEY}" \
174+
LANCE_SPARK_REST_DATABASE="${LANCE_SPARK_REST_DATABASE}" \
175+
LANCE_SPARK_START_REST_DIR="${LANCE_SPARK_START_REST_DIR}" \
176+
LANCE_SPARK_REST_DIR_ROOT="${LANCE_SPARK_REST_DIR_ROOT}" \
177+
LANCE_SPARK_REST_DIR_PORT="${LANCE_SPARK_REST_DIR_PORT}" \
178+
DOCKER_RUN_ARGS="${DOCKER_RUN_ARGS}" \
179+
PYTEST_CMD="${SEARCH_PYTEST_CMD}"

CONTRIBUTING.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,38 @@ To auto-format the code, run:
5858
make format
5959
```
6060

61+
## Docker Integration Tests
62+
63+
Build the Spark bundle and Docker integration-test image before running Docker tests:
64+
65+
```shell
66+
make bundle SPARK_VERSION=3.5 SCALA_VERSION=2.13
67+
make docker-build-test SPARK_VERSION=3.5 SCALA_VERSION=2.13
68+
make docker-test SPARK_VERSION=3.5 SCALA_VERSION=2.13
69+
```
70+
71+
Use `PYTEST_CMD` to run a targeted pytest path in the Docker image. For example, run only the SQL search table-function tests against the directory namespace:
72+
73+
```shell
74+
make docker-test SPARK_VERSION=3.5 SCALA_VERSION=2.13 \
75+
TEST_BACKENDS=local \
76+
PYTEST_CMD="pytest /home/lance/tests/test_lance_spark.py::TestDQLSearchTableFunctions -v --timeout=180"
77+
```
78+
79+
To also validate a REST namespace backed by a directory namespace, let the Docker test container start the OSS Lance REST adapter:
80+
81+
```shell
82+
make docker-test SPARK_VERSION=3.5 SCALA_VERSION=2.13 \
83+
TEST_BACKENDS=local,rest-dir \
84+
LANCE_SPARK_START_REST_DIR=true \
85+
LANCE_SPARK_REST_URI=http://127.0.0.1:10024 \
86+
PYTEST_CMD="pytest /home/lance/tests/test_lance_spark.py::TestDQLSearchTableFunctions -v --timeout=180"
87+
```
88+
89+
To run against an already-running compatible REST namespace server instead, omit `LANCE_SPARK_START_REST_DIR` and pass that server's URI with `LANCE_SPARK_REST_URI`.
90+
91+
The `Spark Search Docker` GitHub Actions workflow runs the same targeted Docker tests. Pull requests run directory namespace and REST-directory namespace coverage automatically. Use workflow dispatch with `rest-uri` only when validating against an external REST namespace server.
92+
6193
## Documentation
6294

6395
### Setup

Makefile

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ endif
4242
DOCKER_CACHE_FROM ?=
4343
DOCKER_CACHE_TO ?=
4444
LANCE_NAMESPACE_IMPL_VERSION ?= $(shell sed -n 's:.*<lance-namespace-impl.version>\(.*\)</lance-namespace-impl.version>.*:\1:p' pom.xml | head -n 1)
45+
PYTEST_CMD ?= pytest /home/lance/tests/ -v --timeout=180
4546

4647
DOCKER_COMPOSE := $(shell \
4748
if docker compose version >/dev/null 2>&1; then \
@@ -190,6 +191,12 @@ docker-test:
190191
$(if $(LANCEDB_API_KEY),-e LANCEDB_API_KEY=$(LANCEDB_API_KEY)) \
191192
$(if $(LANCEDB_HOST_OVERRIDE),-e LANCEDB_HOST_OVERRIDE=$(LANCEDB_HOST_OVERRIDE)) \
192193
$(if $(LANCEDB_REGION),-e LANCEDB_REGION=$(LANCEDB_REGION)) \
194+
$(if $(LANCE_SPARK_REST_URI),-e LANCE_SPARK_REST_URI=$(LANCE_SPARK_REST_URI)) \
195+
$(if $(LANCE_SPARK_REST_API_KEY),-e LANCE_SPARK_REST_API_KEY=$(LANCE_SPARK_REST_API_KEY)) \
196+
$(if $(LANCE_SPARK_REST_DATABASE),-e LANCE_SPARK_REST_DATABASE=$(LANCE_SPARK_REST_DATABASE)) \
197+
$(if $(LANCE_SPARK_START_REST_DIR),-e LANCE_SPARK_START_REST_DIR=$(LANCE_SPARK_START_REST_DIR)) \
198+
$(if $(LANCE_SPARK_REST_DIR_ROOT),-e LANCE_SPARK_REST_DIR_ROOT=$(LANCE_SPARK_REST_DIR_ROOT)) \
199+
$(if $(LANCE_SPARK_REST_DIR_PORT),-e LANCE_SPARK_REST_DIR_PORT=$(LANCE_SPARK_REST_DIR_PORT)) \
193200
$(if $(TEST_BACKENDS),-e TEST_BACKENDS=$(TEST_BACKENDS)) \
194201
$(if $(LANCE_FTS_FORMAT_VERSION),-e LANCE_FTS_FORMAT_VERSION=$(LANCE_FTS_FORMAT_VERSION)) \
195202
$(if $(AWS_REGION),-e AWS_REGION=$(AWS_REGION)) \
@@ -203,8 +210,9 @@ docker-test:
203210
$(if $(AWS_SESSION_TOKEN),-e AWS_SESSION_TOKEN=$(AWS_SESSION_TOKEN)) \
204211
$(if $(AWS_PROFILE),-e AWS_PROFILE=$(AWS_PROFILE)) \
205212
$(if $(AWS_PROFILE),-v $(HOME)/.aws:/root/.aws:ro) \
213+
$(DOCKER_RUN_ARGS) \
206214
lance-spark-test:$(SPARK_VERSION)_$(SCALA_VERSION) \
207-
"pytest /home/lance/tests/ -v --timeout=180"
215+
"$(PYTEST_CMD)"
208216

209217
# =============================================================================
210218
# Benchmark
@@ -295,6 +303,7 @@ help:
295303
@echo " docker-build-test-base - Build test base image (system deps + Spark)"
296304
@echo " docker-build-test - Build test image (base + bundle JAR)"
297305
@echo " docker-test - Run integration tests in lance-spark-test container"
306+
@echo " Override PYTEST_CMD to run a targeted pytest command"
298307
@echo ""
299308
@echo "Benchmark:"
300309
@echo " benchmark-build - Build benchmark jar (shared by TPC-DS and TPC-H)"

docker/Dockerfile.test

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ RUN mkdir -p /home/lance/warehouse /home/lance/spark-events /home/lance/data
3535
# Copy tests
3636
RUN mkdir -p /home/lance/tests
3737
COPY integration-tests/ /home/lance/tests/
38+
RUN javac -cp "${SPARK_HOME}/jars/*" /home/lance/tests/LanceRestDirNamespaceServer.java
3839

3940
WORKDIR ${SPARK_HOME}
4041
COPY docker/entrypoint.sh .

docs/src/config.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ Lance provides SQL extensions that add additional functionality beyond standard
4949

5050
The following features require the Lance Spark SQL extension to be enabled:
5151

52+
- [VECTOR_SEARCH](operations/dql/vector-search.md) - Run vector similarity search through Lance namespace execution
53+
- [SEARCH](operations/dql/search.md) - Run full-text search through Lance namespace execution
54+
- [HYBRID_SEARCH](operations/dql/hybrid-search.md) - Combine vector and full-text search with reciprocal rank fusion
5255
- [ADD COLUMNS with backfill](operations/dml/add-columns.md) - Add new columns and backfill existing rows with data
5356
- [UPDATE COLUMNS with backfill](operations/dml/update-columns.md) - Update existing columns using data from a source
5457
- [OPTIMIZE](operations/ddl/optimize.md) - Compact table fragments for improved query performance

docs/src/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Specifically, you can use the Apache Spark Connector for Lance to:
1717
* **Read & Write Lance Datasets**: Seamlessly read and write datasets stored in the Lance format using Spark.
1818
* **Distributed, Parallel Scans**: Leverage Spark's distributed computing capabilities to perform parallel scans on Lance datasets.
1919
* **Column and Filter Pushdown**: Optimize query performance by pushing down column selections and filters to the data source.
20+
* **SQL Search Table Functions**: Run [vector](operations/dql/vector-search.md), [full-text](operations/dql/search.md), and [hybrid](operations/dql/hybrid-search.md) search through Lance namespace execution.
2021

2122
## Quick Start
2223

@@ -28,4 +29,4 @@ make docker-build
2829
make docker-up
2930
```
3031

31-
And then open the notebook at `http://localhost:8888`.
32+
And then open the notebook at `http://localhost:8888`.

docs/src/operations/ddl/create-index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,8 @@ Create an FTS index on a text column:
147147
);
148148
```
149149

150+
Query the indexed column with the [SEARCH](../dql/search.md) table function.
151+
150152
## Output
151153

152154
The `CREATE INDEX` command returns the following information about the operation:

docs/src/operations/dql/.pages

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
11
title: DQL
22
nav:
33
- select.md
4+
- vector-search.md
5+
- search.md
6+
- hybrid-search.md
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# HYBRID_SEARCH
2+
3+
Run vector search and full-text search together from Spark SQL, then rerank the combined results with reciprocal rank fusion.
4+
5+
!!! warning "Spark Extension Required"
6+
`HYBRID_SEARCH` requires the Lance Spark SQL extension to be enabled. See [Spark SQL Extensions](../../config.md#spark-sql-extensions) for configuration details.
7+
8+
!!! note "Namespace Tables Required"
9+
`HYBRID_SEARCH` resolves the `table` argument through a Spark catalog and executes both side queries through the Lance namespace `queryTable` API. Use a Lance namespace catalog table such as `lance.default.documents`, not a raw Lance dataset path.
10+
11+
!!! note "Named Arguments"
12+
Named arguments require Spark 3.5 or later. On Spark 3.4, use the positional form.
13+
14+
## Basic Usage
15+
16+
`HYBRID_SEARCH` returns the selected table columns plus `_distance`, `_score`, and `_relevance_score`. Rows that only match one side have null for the other side's metric.
17+
18+
=== "SQL"
19+
```sql
20+
SELECT id, body, _distance, _score, _relevance_score
21+
FROM HYBRID_SEARCH(
22+
table => 'lance.default.documents',
23+
query_vector => array(0.12, 0.34, 0.56, 0.78),
24+
query => 'vector database',
25+
vector_column => 'embedding',
26+
search_columns => array('body'),
27+
columns => array('id', 'body'),
28+
num_results => 10,
29+
candidates => 50,
30+
rrf_k => 60.0
31+
)
32+
ORDER BY _relevance_score DESC;
33+
```
34+
35+
## Positional Form
36+
37+
Use positional arguments for simple calls and Spark 3.4 compatibility.
38+
39+
=== "SQL"
40+
```sql
41+
SELECT *
42+
FROM HYBRID_SEARCH('lance.default.documents', array(0.12, 0.34, 0.56), 'lance', 5);
43+
```
44+
45+
## Arguments
46+
47+
| Argument | Type | Required | Description |
48+
|----------|------|----------|-------------|
49+
| `table` | String | Yes | Catalog table name to search. |
50+
| `query_vector` | Array numeric literal | Yes | Query vector. |
51+
| `query` or `search_query` | String | Yes | Full-text query string. |
52+
| `vector_column` | String | No | Vector column name. Lance defaults to `vector` when omitted. |
53+
| `search_columns` | Array string literal | No | Text columns to search. When omitted, Lance uses the indexed columns configured for the FTS index. |
54+
| `num_results`, `limit`, or `k` | Integer | No | Number of final reranked results. Defaults to `10`. |
55+
| `candidates`, `num_candidates`, or `candidate_count` | Integer | No | Number of rows to fetch from each side before reranking. Defaults to `num_results + offset`. Values below `num_results + offset` are raised to that minimum. |
56+
| `rrf_k` | Float | No | Reciprocal rank fusion constant. Defaults to `60.0`. |
57+
| `columns` | Array string literal | No | Output table columns. `_distance`, `_score`, and `_relevance_score` are always included. Use `array('*')` or omit this argument for all table columns. |
58+
| `filter` | String | No | SQL filter expression evaluated by Lance on both side queries. |
59+
| `offset` | Integer | No | Number of reranked results to skip after fusion. Defaults to `0`. |
60+
| `version` | Long | No | Lance table version to search. |
61+
| `distance_type` | String | No | Distance metric such as `l2`, `cosine`, or `dot`. |
62+
| `nprobes`, `ef`, `refine_factor` | Integer | No | Vector index search tuning parameters. |
63+
| `lower_bound`, `upper_bound` | Float | No | Distance bounds. |
64+
| `bypass_vector_index`, `fast_search`, `prefilter`, `with_row_id` | Boolean | No | Lance query options. `with_row_id` adds `_rowid` to the output. |
65+
66+
## Reranking
67+
68+
Hybrid search performs reciprocal rank fusion in Spark:
69+
70+
```text
71+
_relevance_score = sum(1.0 / (rank + rrf_k))
72+
```
73+
74+
Ranks are zero-based in each side's result set. `candidates` controls how many rows are fetched from each side before reranking.
75+
76+
## Output
77+
78+
The result includes the requested table columns plus nullable `_distance` and `_score` float columns and a non-null `_relevance_score` float column. If `with_row_id => true`, or if `_rowid` is listed in `columns`, the result also includes Lance row ids.
79+
80+
## Execution
81+
82+
Spark plans `HYBRID_SEARCH` as a DataSource V2 batch read with one input partition. The partition reader issues one vector `queryTable` request and one full-text `queryTable` request through the Lance namespace API, merges the two result sets in Spark with reciprocal rank fusion, and returns the final rows. With a REST namespace the two side searches can be handled by the REST server, while the final fusion currently happens in the Spark task.
83+
84+
## Validation
85+
86+
The Docker integration suite covers `HYBRID_SEARCH` against the directory namespace and a REST namespace backed by a directory namespace. The `Spark Search Docker` GitHub Actions workflow runs both backends for pull requests.

0 commit comments

Comments
 (0)