|
| 1 | +# IVF Index for sqlite-vec |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +IVF (Inverted File Index) is an approximate nearest neighbor index for |
| 6 | +sqlite-vec's `vec0` virtual table. It partitions vectors into clusters via |
| 7 | +k-means, then at query time only scans the nearest clusters instead of all |
| 8 | +vectors. Combined with scalar or binary quantization, this gives 5-20x query |
| 9 | +speedups over brute-force with tunable recall. |
| 10 | + |
| 11 | +## SQL API |
| 12 | + |
| 13 | +### Table Creation |
| 14 | + |
| 15 | +```sql |
| 16 | +CREATE VIRTUAL TABLE vec_items USING vec0( |
| 17 | + id INTEGER PRIMARY KEY, |
| 18 | + embedding float[768] distance_metric=cosine |
| 19 | + INDEXED BY ivf(nlist=128, nprobe=16) |
| 20 | +); |
| 21 | + |
| 22 | +-- With quantization (4x smaller cells, rescore for recall) |
| 23 | +CREATE VIRTUAL TABLE vec_items USING vec0( |
| 24 | + id INTEGER PRIMARY KEY, |
| 25 | + embedding float[768] distance_metric=cosine |
| 26 | + INDEXED BY ivf(nlist=128, nprobe=16, quantizer=int8, oversample=4) |
| 27 | +); |
| 28 | +``` |
| 29 | + |
| 30 | +### Parameters |
| 31 | + |
| 32 | +| Parameter | Values | Default | Description | |
| 33 | +|-----------|--------|---------|-------------| |
| 34 | +| `nlist` | 1-65536, or 0 | 128 | Number of k-means clusters. Rule of thumb: `sqrt(N)` | |
| 35 | +| `nprobe` | 1-nlist | 10 | Clusters to search at query time. More = better recall, slower | |
| 36 | +| `quantizer` | `none`, `int8`, `binary` | `none` | How vectors are stored in cells | |
| 37 | +| `oversample` | >= 1 | 1 | Re-rank `oversample * k` candidates with full-precision distance | |
| 38 | + |
| 39 | +### Inserting Vectors |
| 40 | + |
| 41 | +```sql |
| 42 | +-- Works immediately, even before training |
| 43 | +INSERT INTO vec_items(id, embedding) VALUES (1, :vector); |
| 44 | +``` |
| 45 | + |
| 46 | +Before centroids exist, vectors go to an "unassigned" partition and queries do |
| 47 | +brute-force. After training, new inserts are assigned to the nearest centroid. |
| 48 | + |
| 49 | +### Training (Computing Centroids) |
| 50 | + |
| 51 | +```sql |
| 52 | +-- Run built-in k-means on all vectors |
| 53 | +INSERT INTO vec_items(id) VALUES ('compute-centroids'); |
| 54 | +``` |
| 55 | + |
| 56 | +This loads all vectors into memory, runs k-means++ with Lloyd's algorithm, |
| 57 | +creates quantized centroids, and redistributes all vectors into cluster cells. |
| 58 | +It's a blocking operation — run it once after bulk insert. |
| 59 | + |
| 60 | +### Manual Centroid Import |
| 61 | + |
| 62 | +```sql |
| 63 | +-- Import externally-computed centroids |
| 64 | +INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:0', :centroid_0); |
| 65 | +INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1); |
| 66 | + |
| 67 | +-- Assign vectors to imported centroids |
| 68 | +INSERT INTO vec_items(id) VALUES ('assign-vectors'); |
| 69 | +``` |
| 70 | + |
| 71 | +### Runtime Parameter Tuning |
| 72 | + |
| 73 | +```sql |
| 74 | +-- Change nprobe without rebuilding the index |
| 75 | +INSERT INTO vec_items(id) VALUES ('nprobe=32'); |
| 76 | +``` |
| 77 | + |
| 78 | +### KNN Queries |
| 79 | + |
| 80 | +```sql |
| 81 | +-- Same syntax as standard vec0 |
| 82 | +SELECT id, distance |
| 83 | +FROM vec_items |
| 84 | +WHERE embedding MATCH :query AND k = 10; |
| 85 | +``` |
| 86 | + |
| 87 | +### Other Commands |
| 88 | + |
| 89 | +```sql |
| 90 | +-- Remove centroids, move all vectors back to unassigned |
| 91 | +INSERT INTO vec_items(id) VALUES ('clear-centroids'); |
| 92 | +``` |
| 93 | + |
| 94 | +## How It Works |
| 95 | + |
| 96 | +### Architecture |
| 97 | + |
| 98 | +``` |
| 99 | +User vector (float32) |
| 100 | + → quantize to int8/binary (if quantizer != none) |
| 101 | + → find nearest centroid (quantized distance) |
| 102 | + → store quantized vector in cell blob |
| 103 | + → store full vector in KV table (if quantizer != none) |
| 104 | + → query: |
| 105 | + 1. quantize query vector |
| 106 | + 2. find top nprobe centroids by quantized distance |
| 107 | + 3. scan cell blobs: quantized distance (fast, small I/O) |
| 108 | + 4. if oversample > 1: re-score top N*k with full vectors |
| 109 | + 5. return top k |
| 110 | +``` |
| 111 | + |
| 112 | +### Shadow Tables |
| 113 | + |
| 114 | +For a table `vec_items` with vector column index 0: |
| 115 | + |
| 116 | +| Table | Schema | Purpose | |
| 117 | +|-------|--------|---------| |
| 118 | +| `vec_items_ivf_centroids00` | `centroid_id PK, centroid BLOB` | K-means centroids (quantized) | |
| 119 | +| `vec_items_ivf_cells00` | `centroid_id, n_vectors, validity BLOB, rowids BLOB, vectors BLOB` | Packed vector cells, 64 vectors max per row. Multiple rows per centroid. Index on centroid_id. | |
| 120 | +| `vec_items_ivf_rowid_map00` | `rowid PK, cell_id, slot` | Maps vector rowid → cell location for O(1) delete | |
| 121 | +| `vec_items_ivf_vectors00` | `rowid PK, vector BLOB` | Full-precision vectors (only when quantizer != none) | |
| 122 | + |
| 123 | +### Cell Storage |
| 124 | + |
| 125 | +Cells use packed blob storage identical to vec0's chunk layout: |
| 126 | +- **validity**: bitmap (1 bit per slot) marking live vectors |
| 127 | +- **rowids**: packed i64 array |
| 128 | +- **vectors**: packed array of quantized vectors |
| 129 | + |
| 130 | +Cells are capped at 64 vectors (~200KB at 768-dim float32, ~48KB for int8, |
| 131 | +~6KB for binary). When a cell fills, a new row is created for the same |
| 132 | +centroid. This avoids SQLite overflow page traversal which was a 110x |
| 133 | +performance bottleneck with unbounded cells. |
| 134 | + |
| 135 | +### Quantization |
| 136 | + |
| 137 | +**int8**: Each float32 dimension clamped to [-1,1] and scaled to int8 |
| 138 | +[-127,127]. 4x storage reduction. Distance computed via int8 L2. |
| 139 | + |
| 140 | +**binary**: Sign-bit quantization — each bit is 1 if the float is positive. |
| 141 | +32x storage reduction. Distance computed via hamming distance. |
| 142 | + |
| 143 | +**Oversample re-ranking**: When `oversample > 1`, the quantized scan collects |
| 144 | +`oversample * k` candidates, then looks up each candidate's full-precision |
| 145 | +vector from the KV table and re-computes exact distance. This recovers nearly |
| 146 | +all recall lost from quantization. At oversample=4 with int8, recall matches |
| 147 | +non-quantized IVF exactly. |
| 148 | + |
| 149 | +### K-Means |
| 150 | + |
| 151 | +Uses Lloyd's algorithm with k-means++ initialization: |
| 152 | +1. K-means++ picks initial centroids weighted by distance |
| 153 | +2. Lloyd's iterations: assign vectors to nearest centroid, recompute centroids as cluster means |
| 154 | +3. Empty cluster handling: reassign to farthest point |
| 155 | +4. K-means runs in float32; centroids are quantized before storage |
| 156 | + |
| 157 | +Training data: recommend 16× nlist vectors. At nlist=1000, that's 16k |
| 158 | +vectors — k-means takes ~140s on 768-dim data. |
| 159 | + |
| 160 | +## Performance |
| 161 | + |
| 162 | +### 100k vectors (COHERE 768-dim cosine) |
| 163 | + |
| 164 | +``` |
| 165 | + name qry(ms) recall |
| 166 | +─────────────────────────────────────────────── |
| 167 | + ivf(q=int8,os=4),p=8 5.3ms 0.934 ← 6x faster than flat |
| 168 | + ivf(q=int8,os=4),p=16 5.4ms 0.968 |
| 169 | + ivf(q=none),p=8 5.3ms 0.934 |
| 170 | + ivf(q=binary,os=10),p=16 1.3ms 0.832 ← 26x faster than flat |
| 171 | + ivf(q=int8,os=4),p=32 7.4ms 0.990 |
| 172 | + ivf(q=none),p=32 15.5ms 0.992 |
| 173 | + int8(os=4) 18.7ms 0.996 |
| 174 | + bit(os=8) 18.7ms 0.884 |
| 175 | + flat 33.7ms 1.000 |
| 176 | +``` |
| 177 | + |
| 178 | +### 1M vectors (COHERE 768-dim cosine) |
| 179 | + |
| 180 | +``` |
| 181 | + name insert train MB qry(ms) recall |
| 182 | +────────────────────────────────────────────────────────────────────── |
| 183 | + ivf(q=int8,os=4),p=8 163s 142s 4725 16.3ms 0.892 |
| 184 | + ivf(q=binary,os=10),p=16 118s 144s 4073 17.7ms 0.830 |
| 185 | + ivf(q=int8,os=4),p=16 163s 142s 4725 24.3ms 0.950 |
| 186 | + ivf(q=int8,os=4),p=32 163s 142s 4725 41.6ms 0.980 |
| 187 | + ivf(q=none),p=8 497s 144s 3101 52.1ms 0.890 |
| 188 | + ivf(q=none),p=16 497s 144s 3101 56.6ms 0.950 |
| 189 | + bit(os=8) 18s - 3048 83.5ms 0.918 |
| 190 | + ivf(q=none),p=32 497s 144s 3101 103.9ms 0.980 |
| 191 | + int8(os=4) 19s - 3689 169.1ms 0.994 |
| 192 | + flat 20s - 2955 338.0ms 1.000 |
| 193 | +``` |
| 194 | + |
| 195 | +**Best config at 1M: `ivf(quantizer=int8, oversample=4, nprobe=16)`** — |
| 196 | +24ms query, 0.95 recall, 14x faster than flat, 7x faster than int8 baseline. |
| 197 | + |
| 198 | +### Scaling Characteristics |
| 199 | + |
| 200 | +| Metric | 100k | 1M | Scaling | |
| 201 | +|--------|------|-----|---------| |
| 202 | +| Flat query | 34ms | 338ms | 10x (linear) | |
| 203 | +| IVF int8 p=16 | 5.4ms | 24.3ms | 4.5x (sublinear) | |
| 204 | +| IVF insert rate | ~10k/s | ~6k/s | Slight degradation | |
| 205 | +| Training (nlist=1000) | 13s | 142s | ~11x | |
| 206 | + |
| 207 | +## Implementation |
| 208 | + |
| 209 | +### File Structure |
| 210 | + |
| 211 | +``` |
| 212 | +sqlite-vec-ivf-kmeans.c K-means++ algorithm (pure C, no SQLite deps) |
| 213 | +sqlite-vec-ivf.c All IVF logic: parser, shadow tables, insert, |
| 214 | + delete, query, centroid commands, quantization |
| 215 | +sqlite-vec.c ~50 lines of additions: struct fields, #includes, |
| 216 | + dispatch hooks in parse/create/insert/delete/filter |
| 217 | +``` |
| 218 | + |
| 219 | +Both IVF files are `#include`d into `sqlite-vec.c`. No Makefile changes needed. |
| 220 | + |
| 221 | +### Key Design Decisions |
| 222 | + |
| 223 | +1. **Fixed-size cells (64 vectors)** instead of one blob per centroid. Avoids |
| 224 | + SQLite overflow page traversal which caused 110x insert slowdown. |
| 225 | + |
| 226 | +2. **Multiple cell rows per centroid** with an index on centroid_id. When a |
| 227 | + cell fills, a new row is created. Query scans all rows for probed centroids |
| 228 | + via `WHERE centroid_id IN (...)`. |
| 229 | + |
| 230 | +3. **Always store full vectors** when quantizer != none (in `_ivf_vectors` KV |
| 231 | + table). Enables oversample re-ranking and point queries returning original |
| 232 | + precision. |
| 233 | + |
| 234 | +4. **K-means in float32, quantize after**. Simpler than quantized k-means, |
| 235 | + and assignment accuracy doesn't suffer much since nprobe compensates. |
| 236 | + |
| 237 | +5. **NEON SIMD for cosine distance**. Added `cosine_float_neon()` with 4-wide |
| 238 | + FMA for dot product + magnitudes. Benefits all vec0 queries, not just IVF. |
| 239 | + |
| 240 | +6. **Runtime nprobe tuning**. `INSERT INTO t(id) VALUES ('nprobe=N')` changes |
| 241 | + the probe count without rebuilding — enables fast parameter sweeps. |
| 242 | + |
| 243 | +### Optimization History |
| 244 | + |
| 245 | +| Optimization | Impact | |
| 246 | +|-------------|--------| |
| 247 | +| Fixed-size cells (64 max) | 110x insert speedup | |
| 248 | +| Skip chunk writes for IVF | 2x DB size reduction | |
| 249 | +| NEON cosine distance | 2x query speedup + 13% recall improvement (correct metric) | |
| 250 | +| Cached prepared statements | Eliminated per-insert prepare/finalize | |
| 251 | +| Batched cell reads (IN clause) | Fewer SQLite queries per KNN | |
| 252 | +| int8 quantization | 2.5x query speedup at same recall | |
| 253 | +| Binary quantization | 32x less cell I/O | |
| 254 | +| Oversample re-ranking | Recovers quantization recall loss | |
| 255 | + |
| 256 | +## Remaining Work |
| 257 | + |
| 258 | +See `ivf-benchmarks/TODO.md` for the full list. Key items: |
| 259 | + |
| 260 | +- **Cache centroids in memory** — each insert re-reads all centroids from SQLite |
| 261 | +- **Runtime oversample** — same pattern as nprobe runtime command |
| 262 | +- **SIMD k-means** — training uses scalar distance, could be 4x faster |
| 263 | +- **Top-k heap** — replace qsort with min-heap for large nprobe |
| 264 | +- **IVF-PQ** — product quantization for better compression/recall tradeoff |
0 commit comments