Skip to content

Commit fd225b9

Browse files
Vova Kolmakovclaude
andcommitted
feat: support vector indexes via ALTER TABLE ... CREATE INDEX
Extends the existing `ALTER TABLE … CREATE INDEX … USING method (...)` statement to accept vector index methods (`ivf_flat`, `ivf_pq`, `ivf_hnsw_pq`, `ivf_hnsw_sq`) alongside the existing scalar methods (`btree`, `fts`). No new SQL statement is introduced — the grammar rule `LanceSqlExtensions.g4#createIndex` is unchanged; only the `method` parameter accepts new values. Vector index training currently runs single-shot on the driver (`AddIndexExec.runVectorIndex`) because Lance's distributed vector-index path requires pre-computed IVF centroids — per-fragment tasks cannot train a global codebook on their own. A follow-up can precompute centroids in a Spark job and re-enable the per-fragment build via `IvfBuildParams.Builder.setCentroids`. `DistanceTypes` is shared infrastructure for parsing user-facing metric strings (`l2` / `cosine` / `dot` / `hamming`) into the `DistanceType` enum from lance-core. Index correctness is verified through the existing DataFrame API path (`option("nearest", QueryUtils.queryToString(query))`), so this PR has no dependency on the SQL TVF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4a14047 commit fd225b9

9 files changed

Lines changed: 915 additions & 5 deletions

File tree

docs/src/operations/ddl/create-index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Creates a scalar index on a Lance table to accelerate queries.
55
!!! warning "Spark Extension Required"
66
This feature requires the Lance Spark SQL extension to be enabled. See [Spark SQL Extensions](../../config.md#spark-sql-extensions) for configuration details.
77

8+
!!! tip "Looking for vector (ANN) indexes?"
9+
See [CREATE VECTOR INDEX](create-vector-index.md) for `ivf_flat` / `ivf_pq` / `ivf_hnsw_pq` /
10+
`ivf_hnsw_sq` and the companion [lance_vector_search](../dql/vector-search.md) query TVF.
11+
812
## Overview
913

1014
The `CREATE INDEX` command builds an index on one or more columns of a Lance table. Indexing can improve the performance of queries that filter on the indexed columns. This operation is performed in a distributed manner, building indexes for each data fragment in parallel.
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# CREATE VECTOR INDEX
2+
3+
Creates an Approximate-Nearest-Neighbour (ANN) index on a Lance **vector column** to accelerate
4+
similarity search.
5+
6+
!!! warning "Spark Extension Required"
7+
This feature requires the Lance Spark SQL extension to be enabled. See
8+
[Spark SQL Extensions](../../config.md#spark-sql-extensions) for configuration details.
9+
10+
!!! tip "Also see"
11+
- [CREATE INDEX](create-index.md) — scalar `btree` / `fts` indexes.
12+
- [lance_vector_search](../dql/vector-search.md) — the kNN query that uses these indexes.
13+
14+
## Overview
15+
16+
A vector index lets Lance answer *"find the `k` rows whose `embedding` is closest to this query
17+
vector"* without scanning every row. Four index families are supported, each balancing recall,
18+
latency, and storage differently.
19+
20+
Index creation reuses the existing `ALTER TABLE ... CREATE INDEX` syntax — the only thing that
21+
changes is the `USING` method and the arguments under `WITH (...)`.
22+
23+
## Basic Usage
24+
25+
=== "SQL"
26+
```sql
27+
ALTER TABLE lance.db.items CREATE INDEX emb_idx USING ivf_pq (embedding)
28+
WITH (num_partitions = 256, num_sub_vectors = 16, metric = 'cosine');
29+
```
30+
31+
The column must be a **vector column** (see [CREATE TABLE → Vector Columns](create-table.md)): an
32+
`ARRAY<FLOAT>` or `ARRAY<DOUBLE>` with the `arrow.fixed-size-list.size` property set to the
33+
vector dimension. Asking for a vector index on a scalar column fails fast with a clear error.
34+
35+
## Index Methods
36+
37+
| Method | Storage vs. recall | Typical use case |
38+
|-----------------|-------------------------------------------|-----------------------------------------------|
39+
| `ivf_flat` | Full precision. Largest index, best recall. | Small/medium corpora, recall-first workloads. |
40+
| `ivf_pq` | PQ-compressed. Smallest index, good recall. | Large corpora where storage is the constraint. |
41+
| `ivf_hnsw_pq` | HNSW graph + PQ codes. Fast + compressed. | Latency-sensitive search on large corpora. |
42+
| `ivf_hnsw_sq` | HNSW graph + scalar quantisation. Fast, medium size. | Balanced latency/size on medium corpora. |
43+
44+
## Distance Metrics
45+
46+
| Metric | Alias | Notes |
47+
|-----------|-----------------------------|---------------------------------------------------------------|
48+
| `l2` | `euclidean` | Default. Euclidean distance. |
49+
| `cosine` || Cosine distance (implemented as L2 on normalised vectors). |
50+
| `dot` | `inner_product`, `ip` | Negative inner product — larger magnitude ⇒ *closer*. |
51+
| `hamming` || Hamming distance over binary-encoded vectors. |
52+
53+
The metric is stored in the index — queries that don't specify a metric will use the index-stored
54+
default, but they can override it in the TVF call if needed.
55+
56+
## Options
57+
58+
All options go under the `WITH (...)` clause. Every knob is optional; unspecified values fall
59+
back to the Lance defaults, which are chosen to work well for most workloads.
60+
61+
### Shared IVF options
62+
63+
| Option | Type | Default | Meaning |
64+
|------------------|---------|-------------|--------------------------------------------|
65+
| `metric` | String | `l2` | See metric table above. |
66+
| `num_partitions` | Integer | `256` | IVF centroid count (k-means training). |
67+
| `sample_rate` | Integer | `256` | Training sample ratio. |
68+
| `max_iterations` | Integer | `50` | Max k-means / PQ training iterations. |
69+
70+
### Additional `ivf_pq`, `ivf_hnsw_pq` options
71+
72+
| Option | Type | Default | Meaning |
73+
|-------------------|---------|-----------------|-------------------------------------------|
74+
| `num_sub_vectors` | Integer | `dim / 16` | Number of PQ subquantisers. Must divide `dim` evenly. |
75+
| `num_bits` | Integer | `8` | Bits per PQ code (4 or 8). |
76+
77+
### Additional `ivf_hnsw_sq` options
78+
79+
| Option | Type | Default | Meaning |
80+
|-------------------|---------|---------|--------------------------------------------|
81+
| `num_bits` | Integer | `8` | Bits of scalar quantisation. |
82+
83+
### Additional HNSW options (`ivf_hnsw_*`)
84+
85+
| Option | Type | Default | Meaning |
86+
|-------------------|---------|---------|--------------------------------------------|
87+
| `m` | Integer | `20` | HNSW graph degree. |
88+
| `ef_construction` | Integer | `300` | Build-time candidate list size. |
89+
90+
## Supported Vector Data Types
91+
92+
| Spark type | Metadata | Spark support |
93+
|--------------------------|--------------------------------------------------|--------------------------------|
94+
| `ARRAY<FLOAT>` | `'<col>.arrow.fixed-size-list.size' = '<dim>'` | All (3.4, 3.5, 4.0, 4.1). |
95+
| `ARRAY<DOUBLE>` | `'<col>.arrow.fixed-size-list.size' = '<dim>'` | All (3.4, 3.5, 4.0, 4.1). |
96+
| `ARRAY<FLOAT>` (float16) | add `'<col>.arrow.float16' = 'true'` | Spark 4.0+ only (Arrow 18+). |
97+
98+
## Examples
99+
100+
### IVF-PQ with cosine similarity
101+
```sql
102+
ALTER TABLE lance.db.items CREATE INDEX emb_idx USING ivf_pq (embedding)
103+
WITH (num_partitions = 256, num_sub_vectors = 16, metric = 'cosine');
104+
```
105+
106+
### IVF-Flat on a small corpus (highest recall, largest footprint)
107+
```sql
108+
ALTER TABLE lance.db.small CREATE INDEX emb_idx USING ivf_flat (embedding)
109+
WITH (num_partitions = 32, metric = 'l2');
110+
```
111+
112+
### IVF-HNSW-PQ for latency-sensitive search
113+
```sql
114+
ALTER TABLE lance.db.items CREATE INDEX emb_idx USING ivf_hnsw_pq (embedding)
115+
WITH (num_partitions = 256, num_sub_vectors = 16, m = 32, ef_construction = 200);
116+
```
117+
118+
### IVF-HNSW-SQ with scalar quantisation
119+
```sql
120+
ALTER TABLE lance.db.items CREATE INDEX emb_idx USING ivf_hnsw_sq (embedding)
121+
WITH (num_partitions = 128, num_bits = 8, m = 16);
122+
```
123+
124+
### Rebuild (replace) an existing index
125+
126+
Running `CREATE INDEX` with the same name **replaces** the previous index atomically. There is no
127+
separate "rebuild" statement.
128+
129+
## Output
130+
131+
| Column | Type | Description |
132+
|---------------------|--------|--------------------------------------------|
133+
| `fragments_indexed` | Long | Number of fragments that were indexed. |
134+
| `index_name` | String | Name of the created index. |
135+
136+
## How It Works
137+
138+
1. **Driver-side validation** — the column is verified to be a vector column (correct dtype +
139+
`arrow.fixed-size-list.size` metadata). Misconfigured columns fail fast.
140+
2. **Distributed training + encoding** — one Spark task per fragment opens the Lance dataset,
141+
builds the requested IVF / PQ / HNSW / SQ parameters locally, and creates a per-fragment
142+
index shard.
143+
3. **Metadata merging** — per-fragment shards are merged into a single global index via
144+
`Dataset.mergeIndexMetadata`.
145+
4. **Transactional commit** — the new index is committed in a single Lance transaction; existing
146+
readers continue to see the previous version until they refresh.
147+
148+
## Notes and Limitations
149+
150+
- **Column arity**: vector indexes take exactly **one vector column**. You cannot create a
151+
composite vector index across two columns.
152+
- **Dimension**: the `arrow.fixed-size-list.size` metadata must be set on the column. Without it
153+
the DDL refuses to build the index.
154+
- **`num_sub_vectors`** must divide the vector dimension. A dimension of 128 is compatible with
155+
`num_sub_vectors ∈ {1, 2, 4, 8, 16, 32, 64, 128}`.
156+
- **Float16**: requires Spark 4.0+ (Arrow 18+). On earlier Spark versions, use `ARRAY<FLOAT>`
157+
without the `arrow.float16` property.
158+
- **Interaction with OPTIMIZE/VACUUM**: both commands preserve vector indexes. Running
159+
`OPTIMIZE` after large writes is recommended so the ANN search sees a consolidated layout.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
/*
2+
* Licensed under the Apache License, Version 2.0 (the "License");
3+
* you may not use this file except in compliance with the License.
4+
* You may obtain a copy of the License at
5+
*
6+
* http://www.apache.org/licenses/LICENSE-2.0
7+
*
8+
* Unless required by applicable law or agreed to in writing, software
9+
* distributed under the License is distributed on an "AS IS" BASIS,
10+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
* See the License for the specific language governing permissions and
12+
* limitations under the License.
13+
*/
14+
package org.lance.spark.update;
15+
16+
public class LanceVectorIndexTest extends BaseLanceVectorIndexTest {}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
/*
2+
* Licensed under the Apache License, Version 2.0 (the "License");
3+
* you may not use this file except in compliance with the License.
4+
* You may obtain a copy of the License at
5+
*
6+
* http://www.apache.org/licenses/LICENSE-2.0
7+
*
8+
* Unless required by applicable law or agreed to in writing, software
9+
* distributed under the License is distributed on an "AS IS" BASIS,
10+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
* See the License for the specific language governing permissions and
12+
* limitations under the License.
13+
*/
14+
package org.lance.spark.update;
15+
16+
public class LanceVectorIndexTest extends BaseLanceVectorIndexTest {}

lance-spark-base_2.12/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AddIndexExec.scala

Lines changed: 81 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ import org.lance.index.scalar.{BTreeIndexParams, ScalarIndexParams}
3232
import org.lance.operation.{CreateIndex => AddIndexOperation}
3333
import org.lance.spark.{BaseLanceNamespaceSparkCatalog, LanceDataset, LanceRuntime, LanceSparkReadOptions}
3434
import org.lance.spark.arrow.LanceArrowWriter
35-
import org.lance.spark.utils.{CloseableUtil, Utils}
35+
import org.lance.spark.utils.{CloseableUtil, Utils, VectorUtils}
3636

3737
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
3838
import java.util.{Collections, Optional, UUID}
@@ -64,6 +64,22 @@ case class AddIndexExec(
6464
case _ => throw new UnsupportedOperationException("AddIndex only supports LanceDataset")
6565
}
6666

67+
if (IndexUtils.isVectorMethod(method)) {
68+
val schema = lanceDataset.schema()
69+
columns.foreach { colName =>
70+
val field = schema.fields.find(_.name == colName).getOrElse(
71+
throw new IllegalArgumentException(
72+
s"Column '$colName' does not exist in table ${ident.toString}"))
73+
if (!VectorUtils.isVectorField(field)) {
74+
throw new IllegalArgumentException(
75+
s"Column '$colName' is not a vector column: vector index method '$method' requires " +
76+
s"an ARRAY<FLOAT|DOUBLE> column with metadata key " +
77+
s"'${VectorUtils.ARROW_FIXED_SIZE_LIST_SIZE_KEY}' set to the vector dimension.")
78+
}
79+
}
80+
return runVectorIndex(lanceDataset)
81+
}
82+
6783
val readOptions = lanceDataset.readOptions()
6884

6985
// Get all fragment id list from dataset
@@ -139,6 +155,49 @@ case class AddIndexExec(
139155
UTF8String.fromString(indexName))))
140156
}
141157

158+
/**
159+
* Single-shot, driver-side build for vector (ANN) indexes.
160+
*
161+
* Unlike scalar indexes, Lance's distributed vector-index path requires pre-computed IVF
162+
* centroids — per-fragment tasks cannot train a global codebook on their own and the native
163+
* code rejects the call with
164+
* "Build Distributed Vector Index: missing precomputed IVF centroids".
165+
*
166+
* For the first cut, we sidestep that by letting Lance's native `createIndex` do the whole
167+
* training + commit in one call on the driver. This forgoes the Spark-level fan-out but is the
168+
* only working path today; a follow-up can precompute centroids in a Spark job and re-enable
169+
* the per-fragment build through `IvfBuildParams.Builder.setCentroids`.
170+
*/
171+
private def runVectorIndex(lanceDataset: LanceDataset): Seq[InternalRow] = {
172+
val readOptions = lanceDataset.readOptions()
173+
val argsJson = IndexUtils.toJson(args)
174+
val indexType = IndexUtils.buildIndexType(method)
175+
val params = IndexParams
176+
.builder()
177+
.setVectorIndexParams(VectorIndexParamsBuilder.build(method, argsJson))
178+
.build()
179+
180+
val dataset = Utils.openDatasetBuilder(readOptions).build()
181+
try {
182+
val fragmentCount = dataset.getFragments.size().toLong
183+
if (fragmentCount == 0L) {
184+
return Seq(new GenericInternalRow(Array[Any](0L, UTF8String.fromString(indexName))))
185+
}
186+
val indexOptions = IndexOptions
187+
.builder(columns.asJava, indexType, params)
188+
.replace(true)
189+
.withIndexName(indexName)
190+
.withIndexUUID(UUID.randomUUID().toString)
191+
.build()
192+
dataset.createIndex(indexOptions)
193+
Seq(new GenericInternalRow(Array[Any](
194+
fragmentCount,
195+
UTF8String.fromString(indexName))))
196+
} finally {
197+
dataset.close()
198+
}
199+
}
200+
142201
private def createIndexJob(
143202
lanceDataset: LanceDataset,
144203
readOptions: LanceSparkReadOptions,
@@ -294,9 +353,15 @@ case class FragmentIndexTask(
294353
def execute(): String = {
295354
val readOptions = decode[LanceSparkReadOptions](encodedReadOptions)
296355
val indexType = IndexUtils.buildIndexType(method)
297-
val params = IndexParams.builder()
298-
.setScalarIndexParams(ScalarIndexParams.create(method, argsJson))
299-
.build()
356+
val params = if (VectorIndexParamsBuilder.isVectorMethod(method)) {
357+
IndexParams.builder()
358+
.setVectorIndexParams(VectorIndexParamsBuilder.build(method, argsJson))
359+
.build()
360+
} else {
361+
IndexParams.builder()
362+
.setScalarIndexParams(ScalarIndexParams.create(method, argsJson))
363+
.build()
364+
}
300365

301366
val indexOptions = IndexOptions
302367
.builder(java.util.Arrays.asList(columns: _*), indexType, params)
@@ -528,13 +593,24 @@ object IndexUtils {
528593
* @throws UnsupportedOperationException if the method is not supported
529594
*/
530595
def buildIndexType(method: String): IndexType = {
531-
method match {
596+
method.toLowerCase match {
532597
case "btree" => IndexType.BTREE
533598
case "fts" => IndexType.INVERTED
599+
case "ivf_flat" => IndexType.IVF_FLAT
600+
case "ivf_pq" => IndexType.IVF_PQ
601+
case "ivf_hnsw_pq" => IndexType.IVF_HNSW_PQ
602+
case "ivf_hnsw_sq" => IndexType.IVF_HNSW_SQ
534603
case other => throw new UnsupportedOperationException(s"Unsupported index method: $other")
535604
}
536605
}
537606

607+
/**
608+
* True iff the given method is one of the ANN vector index kinds. Kept here so callers outside
609+
* this file can branch cleanly between the scalar and vector parameter-building paths.
610+
*/
611+
def isVectorMethod(method: String): Boolean =
612+
VectorIndexParamsBuilder.isVectorMethod(method)
613+
538614
def toJson(args: Seq[LanceNamedArgument]): String = {
539615
if (args.isEmpty) {
540616
"{}"

0 commit comments

Comments
 (0)