Skip to content

Commit 3358e12

Browse files
committed
Add IVF index for vec0 virtual table
Add inverted file (IVF) index type: partitions vectors into clusters via k-means, quantizes to int8, and scans only the nearest nprobe partitions at query time. Includes shadow table management, insert/delete, KNN integration, compile flag (SQLITE_VEC_ENABLE_IVF), fuzz targets, and tests. Removes superseded ivf-benchmarks/ directory.
1 parent 43982c1 commit 3358e12

22 files changed

Lines changed: 5237 additions & 28 deletions

IVF_PLAN.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# IVF Index for sqlite-vec
2+
3+
## Overview
4+
5+
IVF (Inverted File Index) is an approximate nearest neighbor index for
6+
sqlite-vec's `vec0` virtual table. It partitions vectors into clusters via
7+
k-means, then at query time only scans the nearest clusters instead of all
8+
vectors. Combined with scalar or binary quantization, this gives 5-20x query
9+
speedups over brute-force with tunable recall.
10+
11+
## SQL API
12+
13+
### Table Creation
14+
15+
```sql
16+
CREATE VIRTUAL TABLE vec_items USING vec0(
17+
id INTEGER PRIMARY KEY,
18+
embedding float[768] distance_metric=cosine
19+
INDEXED BY ivf(nlist=128, nprobe=16)
20+
);
21+
22+
-- With quantization (4x smaller cells, rescore for recall)
23+
CREATE VIRTUAL TABLE vec_items USING vec0(
24+
id INTEGER PRIMARY KEY,
25+
embedding float[768] distance_metric=cosine
26+
INDEXED BY ivf(nlist=128, nprobe=16, quantizer=int8, oversample=4)
27+
);
28+
```
29+
30+
### Parameters
31+
32+
| Parameter | Values | Default | Description |
33+
|-----------|--------|---------|-------------|
34+
| `nlist` | 1-65536, or 0 | 128 | Number of k-means clusters. Rule of thumb: `sqrt(N)` |
35+
| `nprobe` | 1-nlist | 10 | Clusters to search at query time. More = better recall, slower |
36+
| `quantizer` | `none`, `int8`, `binary` | `none` | How vectors are stored in cells |
37+
| `oversample` | >= 1 | 1 | Re-rank `oversample * k` candidates with full-precision distance |
38+
39+
### Inserting Vectors
40+
41+
```sql
42+
-- Works immediately, even before training
43+
INSERT INTO vec_items(id, embedding) VALUES (1, :vector);
44+
```
45+
46+
Before centroids exist, vectors go to an "unassigned" partition and queries do
47+
brute-force. After training, new inserts are assigned to the nearest centroid.
48+
49+
### Training (Computing Centroids)
50+
51+
```sql
52+
-- Run built-in k-means on all vectors
53+
INSERT INTO vec_items(id) VALUES ('compute-centroids');
54+
```
55+
56+
This loads all vectors into memory, runs k-means++ with Lloyd's algorithm,
57+
creates quantized centroids, and redistributes all vectors into cluster cells.
58+
It's a blocking operation — run it once after bulk insert.
59+
60+
### Manual Centroid Import
61+
62+
```sql
63+
-- Import externally-computed centroids
64+
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:0', :centroid_0);
65+
INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);
66+
67+
-- Assign vectors to imported centroids
68+
INSERT INTO vec_items(id) VALUES ('assign-vectors');
69+
```
70+
71+
### Runtime Parameter Tuning
72+
73+
```sql
74+
-- Change nprobe without rebuilding the index
75+
INSERT INTO vec_items(id) VALUES ('nprobe=32');
76+
```
77+
78+
### KNN Queries
79+
80+
```sql
81+
-- Same syntax as standard vec0
82+
SELECT id, distance
83+
FROM vec_items
84+
WHERE embedding MATCH :query AND k = 10;
85+
```
86+
87+
### Other Commands
88+
89+
```sql
90+
-- Remove centroids, move all vectors back to unassigned
91+
INSERT INTO vec_items(id) VALUES ('clear-centroids');
92+
```
93+
94+
## How It Works
95+
96+
### Architecture
97+
98+
```
99+
User vector (float32)
100+
→ quantize to int8/binary (if quantizer != none)
101+
→ find nearest centroid (quantized distance)
102+
→ store quantized vector in cell blob
103+
→ store full vector in KV table (if quantizer != none)
104+
→ query:
105+
1. quantize query vector
106+
2. find top nprobe centroids by quantized distance
107+
3. scan cell blobs: quantized distance (fast, small I/O)
108+
4. if oversample > 1: re-score top N*k with full vectors
109+
5. return top k
110+
```
111+
112+
### Shadow Tables
113+
114+
For a table `vec_items` with vector column index 0:
115+
116+
| Table | Schema | Purpose |
117+
|-------|--------|---------|
118+
| `vec_items_ivf_centroids00` | `centroid_id PK, centroid BLOB` | K-means centroids (quantized) |
119+
| `vec_items_ivf_cells00` | `centroid_id, n_vectors, validity BLOB, rowids BLOB, vectors BLOB` | Packed vector cells, 64 vectors max per row. Multiple rows per centroid. Index on centroid_id. |
120+
| `vec_items_ivf_rowid_map00` | `rowid PK, cell_id, slot` | Maps vector rowid → cell location for O(1) delete |
121+
| `vec_items_ivf_vectors00` | `rowid PK, vector BLOB` | Full-precision vectors (only when quantizer != none) |
122+
123+
### Cell Storage
124+
125+
Cells use packed blob storage identical to vec0's chunk layout:
126+
- **validity**: bitmap (1 bit per slot) marking live vectors
127+
- **rowids**: packed i64 array
128+
- **vectors**: packed array of quantized vectors
129+
130+
Cells are capped at 64 vectors (~200KB at 768-dim float32, ~48KB for int8,
131+
~6KB for binary). When a cell fills, a new row is created for the same
132+
centroid. This avoids SQLite overflow page traversal which was a 110x
133+
performance bottleneck with unbounded cells.
134+
135+
### Quantization
136+
137+
**int8**: Each float32 dimension clamped to [-1,1] and scaled to int8
138+
[-127,127]. 4x storage reduction. Distance computed via int8 L2.
139+
140+
**binary**: Sign-bit quantization — each bit is 1 if the float is positive.
141+
32x storage reduction. Distance computed via hamming distance.
142+
143+
**Oversample re-ranking**: When `oversample > 1`, the quantized scan collects
144+
`oversample * k` candidates, then looks up each candidate's full-precision
145+
vector from the KV table and re-computes exact distance. This recovers nearly
146+
all recall lost from quantization. At oversample=4 with int8, recall matches
147+
non-quantized IVF exactly.
148+
149+
### K-Means
150+
151+
Uses Lloyd's algorithm with k-means++ initialization:
152+
1. K-means++ picks initial centroids weighted by distance
153+
2. Lloyd's iterations: assign vectors to nearest centroid, recompute centroids as cluster means
154+
3. Empty cluster handling: reassign to farthest point
155+
4. K-means runs in float32; centroids are quantized before storage
156+
157+
Training data: recommend 16× nlist vectors. At nlist=1000, that's 16k
158+
vectors — k-means takes ~140s on 768-dim data.
159+
160+
## Performance
161+
162+
### 100k vectors (COHERE 768-dim cosine)
163+
164+
```
165+
name qry(ms) recall
166+
───────────────────────────────────────────────
167+
ivf(q=int8,os=4),p=8 5.3ms 0.934 ← 6x faster than flat
168+
ivf(q=int8,os=4),p=16 5.4ms 0.968
169+
ivf(q=none),p=8 5.3ms 0.934
170+
ivf(q=binary,os=10),p=16 1.3ms 0.832 ← 26x faster than flat
171+
ivf(q=int8,os=4),p=32 7.4ms 0.990
172+
ivf(q=none),p=32 15.5ms 0.992
173+
int8(os=4) 18.7ms 0.996
174+
bit(os=8) 18.7ms 0.884
175+
flat 33.7ms 1.000
176+
```
177+
178+
### 1M vectors (COHERE 768-dim cosine)
179+
180+
```
181+
name insert train MB qry(ms) recall
182+
──────────────────────────────────────────────────────────────────────
183+
ivf(q=int8,os=4),p=8 163s 142s 4725 16.3ms 0.892
184+
ivf(q=binary,os=10),p=16 118s 144s 4073 17.7ms 0.830
185+
ivf(q=int8,os=4),p=16 163s 142s 4725 24.3ms 0.950
186+
ivf(q=int8,os=4),p=32 163s 142s 4725 41.6ms 0.980
187+
ivf(q=none),p=8 497s 144s 3101 52.1ms 0.890
188+
ivf(q=none),p=16 497s 144s 3101 56.6ms 0.950
189+
bit(os=8) 18s - 3048 83.5ms 0.918
190+
ivf(q=none),p=32 497s 144s 3101 103.9ms 0.980
191+
int8(os=4) 19s - 3689 169.1ms 0.994
192+
flat 20s - 2955 338.0ms 1.000
193+
```
194+
195+
**Best config at 1M: `ivf(quantizer=int8, oversample=4, nprobe=16)`**
196+
24ms query, 0.95 recall, 14x faster than flat, 7x faster than int8 baseline.
197+
198+
### Scaling Characteristics
199+
200+
| Metric | 100k | 1M | Scaling |
201+
|--------|------|-----|---------|
202+
| Flat query | 34ms | 338ms | 10x (linear) |
203+
| IVF int8 p=16 | 5.4ms | 24.3ms | 4.5x (sublinear) |
204+
| IVF insert rate | ~10k/s | ~6k/s | Slight degradation |
205+
| Training (nlist=1000) | 13s | 142s | ~11x |
206+
207+
## Implementation
208+
209+
### File Structure
210+
211+
```
212+
sqlite-vec-ivf-kmeans.c K-means++ algorithm (pure C, no SQLite deps)
213+
sqlite-vec-ivf.c All IVF logic: parser, shadow tables, insert,
214+
delete, query, centroid commands, quantization
215+
sqlite-vec.c ~50 lines of additions: struct fields, #includes,
216+
dispatch hooks in parse/create/insert/delete/filter
217+
```
218+
219+
Both IVF files are `#include`d into `sqlite-vec.c`. No Makefile changes needed.
220+
221+
### Key Design Decisions
222+
223+
1. **Fixed-size cells (64 vectors)** instead of one blob per centroid. Avoids
224+
SQLite overflow page traversal which caused 110x insert slowdown.
225+
226+
2. **Multiple cell rows per centroid** with an index on centroid_id. When a
227+
cell fills, a new row is created. Query scans all rows for probed centroids
228+
via `WHERE centroid_id IN (...)`.
229+
230+
3. **Always store full vectors** when quantizer != none (in `_ivf_vectors` KV
231+
table). Enables oversample re-ranking and point queries returning original
232+
precision.
233+
234+
4. **K-means in float32, quantize after**. Simpler than quantized k-means,
235+
and assignment accuracy doesn't suffer much since nprobe compensates.
236+
237+
5. **NEON SIMD for cosine distance**. Added `cosine_float_neon()` with 4-wide
238+
FMA for dot product + magnitudes. Benefits all vec0 queries, not just IVF.
239+
240+
6. **Runtime nprobe tuning**. `INSERT INTO t(id) VALUES ('nprobe=N')` changes
241+
the probe count without rebuilding — enables fast parameter sweeps.
242+
243+
### Optimization History
244+
245+
| Optimization | Impact |
246+
|-------------|--------|
247+
| Fixed-size cells (64 max) | 110x insert speedup |
248+
| Skip chunk writes for IVF | 2x DB size reduction |
249+
| NEON cosine distance | 2x query speedup + 13% recall improvement (correct metric) |
250+
| Cached prepared statements | Eliminated per-insert prepare/finalize |
251+
| Batched cell reads (IN clause) | Fewer SQLite queries per KNN |
252+
| int8 quantization | 2.5x query speedup at same recall |
253+
| Binary quantization | 32x less cell I/O |
254+
| Oversample re-ranking | Recovers quantization recall loss |
255+
256+
## Remaining Work
257+
258+
See `ivf-benchmarks/TODO.md` for the full list. Key items:
259+
260+
- **Cache centroids in memory** — each insert re-reads all centroids from SQLite
261+
- **Runtime oversample** — same pattern as nprobe runtime command
262+
- **SIMD k-means** — training uses scalar distance, could be 4x faster
263+
- **Top-k heap** — replace qsort with min-heap for large nprobe
264+
- **IVF-PQ** — product quantization for better compression/recall tradeoff

Makefile

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,21 @@ test-loadable-watch:
206206
test-unit:
207207
$(CC) -DSQLITE_CORE -DSQLITE_VEC_TEST -DSQLITE_VEC_ENABLE_RESCORE tests/test-unit.c sqlite-vec.c vendor/sqlite3.c -I./ -Ivendor -o $(prefix)/test-unit && $(prefix)/test-unit
208208

209+
# Standalone sqlite3 CLI with vec0 compiled in. Useful for benchmarking,
210+
# profiling (has debug symbols), and scripting without .load_extension.
211+
# make cli
212+
# dist/sqlite3 :memory: "SELECT vec_version()"
213+
# dist/sqlite3 < script.sql
214+
cli: sqlite-vec.h $(prefix)
215+
$(CC) -O2 -g \
216+
-DSQLITE_CORE \
217+
-DSQLITE_EXTRA_INIT=core_init \
218+
-DSQLITE_THREADSAFE=0 \
219+
-Ivendor/ -I./ \
220+
$(CFLAGS) \
221+
vendor/sqlite3.c vendor/shell.c sqlite-vec.c examples/sqlite3-cli/core_init.c \
222+
-ldl -lm -o $(prefix)/sqlite3
223+
209224
fuzz-build:
210225
$(MAKE) -C tests/fuzz all
211226

benchmarks-ann/Makefile

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,27 +8,20 @@ BASELINES = \
88
"brute-int8:type=baseline,variant=int8" \
99
"brute-bit:type=baseline,variant=bit"
1010

11-
# --- Index-specific configs ---
12-
# Each index branch should add its own configs here. Example:
13-
#
14-
# DISKANN_CONFIGS = \
15-
# "diskann-R48-binary:type=diskann,R=48,L=128,quantizer=binary" \
16-
# "diskann-R72-int8:type=diskann,R=72,L=128,quantizer=int8"
17-
#
18-
# IVF_CONFIGS = \
19-
# "ivf-n128-p16:type=ivf,nlist=128,nprobe=16"
20-
#
21-
# ANNOY_CONFIGS = \
22-
# "annoy-t50:type=annoy,n_trees=50"
11+
# --- IVF configs ---
12+
IVF_CONFIGS = \
13+
"ivf-n32-p8:type=ivf,nlist=32,nprobe=8" \
14+
"ivf-n128-p16:type=ivf,nlist=128,nprobe=16" \
15+
"ivf-n512-p32:type=ivf,nlist=512,nprobe=32"
2316

2417
RESCORE_CONFIGS = \
2518
"rescore-bit-os8:type=rescore,quantizer=bit,oversample=8" \
2619
"rescore-bit-os16:type=rescore,quantizer=bit,oversample=16" \
2720
"rescore-int8-os8:type=rescore,quantizer=int8,oversample=8"
2821

29-
ALL_CONFIGS = $(BASELINES) $(RESCORE_CONFIGS)
22+
ALL_CONFIGS = $(BASELINES) $(RESCORE_CONFIGS) $(IVF_CONFIGS)
3023

31-
.PHONY: seed ground-truth bench-smoke bench-rescore bench-10k bench-50k bench-100k bench-all \
24+
.PHONY: seed ground-truth bench-smoke bench-rescore bench-ivf bench-10k bench-50k bench-100k bench-all \
3225
report clean
3326

3427
# --- Data preparation ---
@@ -43,7 +36,8 @@ ground-truth: seed
4336
# --- Quick smoke test ---
4437
bench-smoke: seed
4538
$(BENCH) --subset-size 5000 -k 10 -n 20 -o runs/smoke \
46-
$(BASELINES)
39+
"brute-float:type=baseline,variant=float" \
40+
"ivf-quick:type=ivf,nlist=16,nprobe=4"
4741

4842
bench-rescore: seed
4943
$(BENCH) --subset-size 10000 -k 10 -o runs/rescore \
@@ -62,6 +56,12 @@ bench-100k: seed
6256

6357
bench-all: bench-10k bench-50k bench-100k
6458

59+
# --- IVF across sizes ---
60+
bench-ivf: seed
61+
$(BENCH) --subset-size 10000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
62+
$(BENCH) --subset-size 50000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
63+
$(BENCH) --subset-size 100000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
64+
6565
# --- Report ---
6666
report:
6767
@echo "Use: sqlite3 runs/<dir>/results.db 'SELECT * FROM bench_results ORDER BY recall DESC'"

benchmarks-ann/bench.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,48 @@ def _rescore_describe(params):
173173
}
174174

175175

176+
# ============================================================================
177+
# IVF implementation
178+
# ============================================================================
179+
180+
181+
def _ivf_create_table_sql(params):
182+
return (
183+
f"CREATE VIRTUAL TABLE vec_items USING vec0("
184+
f" id integer primary key,"
185+
f" embedding float[768] distance_metric=cosine"
186+
f" indexed by ivf("
187+
f" nlist={params['nlist']},"
188+
f" nprobe={params['nprobe']}"
189+
f" )"
190+
f")"
191+
)
192+
193+
194+
def _ivf_post_insert_hook(conn, params):
195+
print(" Training k-means centroids...", flush=True)
196+
t0 = time.perf_counter()
197+
conn.execute("INSERT INTO vec_items(id) VALUES ('compute-centroids')")
198+
conn.commit()
199+
elapsed = time.perf_counter() - t0
200+
print(f" Training done in {elapsed:.1f}s", flush=True)
201+
return elapsed
202+
203+
204+
def _ivf_describe(params):
205+
return f"ivf nlist={params['nlist']:<4} nprobe={params['nprobe']}"
206+
207+
208+
INDEX_REGISTRY["ivf"] = {
209+
"defaults": {"nlist": 128, "nprobe": 16},
210+
"create_table_sql": _ivf_create_table_sql,
211+
"insert_sql": None,
212+
"post_insert_hook": _ivf_post_insert_hook,
213+
"run_query": None,
214+
"describe": _ivf_describe,
215+
}
216+
217+
176218
# ============================================================================
177219
# Config parsing
178220
# ============================================================================

0 commit comments

Comments
 (0)