You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add distributed zonemap index build with configurable segments (#516)
## Summary
- Add zonemap as a new index type in `CREATE INDEX` DDL with distributed
build support
- Batch fragments into configurable segments via `num_segments` option
(defaults to `spark.default.parallelism`)
- Each segment is built in parallel on Spark executors and committed as
a logical index on the driver
- Zonemap indexes currently support single column only
## What Changed
- `AddIndexExec.scala`: Zonemap-specific path with
`ZonemapIndexJob`/`ZonemapIndexTask` and `commitIndexSegments`
- `create-index.md`: Document zonemap index type, options, and usage
- Tests: unit tests for segment creation/validation and integration test
## Notes
- Rebased cleanly onto current `main`
- Depends on lance-core `7.0.0-beta.10` or newer which includes zonemap
segment support
- Supersedes PR #473 and closed PR #466
## Test plan
- [x] CI passes (lint, unit tests, integration tests across all
Spark/Scala versions)
- [x] Zonemap index creation with default segment count
- [x] Zonemap index creation with explicit `num_segments`
- [x] Repeated zonemap index creation replaces existing segments
- [x] Query correctness after zonemap index creation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
Copy file name to clipboardExpand all lines: docs/src/operations/ddl/create-index.md
+35-5Lines changed: 35 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ Creates a scalar index on a Lance table to accelerate queries.
7
7
8
8
## Overview
9
9
10
-
The `CREATE INDEX` command builds an index on one or more columns of a Lance table. Indexing can improve the performance of queries that filter on the indexed columns. This operation is performed in a distributed manner, building indexes for each data fragment in parallel.
10
+
The `CREATE INDEX` command builds an index on one or more columns of a Lance table. Indexing can improve the performance of queries that filter on the indexed columns. Depending on the index method, Lance Spark either uses a fragment-parallel build path or a driver-coordinated commit flow after parallel executor builds.
11
11
12
12
## Basic Usage
13
13
@@ -24,13 +24,23 @@ The following index methods are supported:
|`rows_per_zone`| Long | The approximate number of rows per zonemap zone. |
42
+
|`num_segments`| Integer | Target number of index segments (upper bound; clamped to fragment count when larger). Each segment covers a batch of fragments. Defaults to `min(fragment_count, spark.default.parallelism)`. |
43
+
34
44
### BTree Options
35
45
36
46
For the `btree` method, the following options are supported:
@@ -92,6 +102,15 @@ Create a composite index on multiple columns.
92
102
ALTER TABLE lance.db.logs CREATE INDEX idx_ts_level USING btree (timestamp, level);
93
103
```
94
104
105
+
### Lightweight Fragment Pruning
106
+
107
+
Create a zonemap index when you want lightweight min/max-based fragment pruning:
108
+
109
+
=== "SQL"
110
+
```sql
111
+
ALTER TABLE lance.db.users CREATE INDEX idx_id_zonemap USING zonemap (id);
112
+
```
113
+
95
114
### Indexing with Options
96
115
97
116
Create an index and specify the `zone_size` for the B-tree:
@@ -101,6 +120,15 @@ Create an index and specify the `zone_size` for the B-tree:
101
120
ALTER TABLE lance.db.users CREATE INDEX idx_id_zoned USING btree (id) WITH (zone_size = 2048);
102
121
```
103
122
123
+
### Zonemap with Options
124
+
125
+
Create a zonemap index and specify the approximate number of rows per zone:
126
+
127
+
=== "SQL"
128
+
```sql
129
+
ALTER TABLE lance.db.users CREATE INDEX idx_id_zonemap USING zonemap (id) WITH (rows_per_zone = 2048);
130
+
```
131
+
104
132
### Full-Text Search Index
105
133
106
134
Create an FTS index on a text column:
@@ -133,17 +161,19 @@ The `CREATE INDEX` command returns the following information about the operation
133
161
Consider creating an index when:
134
162
135
163
- You frequently filter a large table on a specific column.
164
+
- You want lightweight fragment pruning based on per-zone min/max statistics.
136
165
- Your queries involve point lookups or small range scans.
137
166
138
167
## How It Works
139
168
140
169
The `CREATE INDEX` command operates as follows:
141
170
142
-
1.**Distributed Index Building**: For each fragment in the Lance dataset, a separate task is launched to build an index on the specified column(s).
143
-
2.**Metadata Merging**: Once all per-fragment indexes are built, their metadata is collected and merged.
171
+
1.**Index Build Execution**: Lance Spark chooses an execution path based on the index method. Methods such as `btree`, `fts`, and `zonemap` can build physical index segments in parallel across fragments. `zonemap` publishes those segments directly as one logical index. Range-mode `btree` uses Spark repartitioning and sorted preprocessed data.
172
+
2.**Metadata Finalization**: Lance Spark merges or commits the resulting index metadata on the driver so the new logical index becomes visible atomically.
144
173
3.**Transactional Commit**: A new table version is committed with the new index information. The operation is atomic and ensures that concurrent reads are not affected.
145
174
146
175
## Notes and Limitations
147
176
148
-
-**Index Methods**: The `btree` and `fts` methods are supported for scalar index creation.
149
-
-**Index Replacement**: If you create an index with the same name as an existing one, the old index will be replaced by the new one. This is because the underlying implementation uses `replace(true)`.
177
+
-**Index Methods**: The `zonemap`, `btree`, and `fts` methods are supported for scalar index creation.
178
+
-**Zonemap Column Count**: Zonemap indexes currently support a single column only. The generic `CREATE INDEX` grammar accepts a column list, but Lance rejects multi-column zonemap creation.
179
+
-**Index Replacement**: If you create an index with the same name as an existing one, the old index will be replaced by the new one.
0 commit comments