asg017
diff --git a/‎IVF_PLAN.md‎
Lines changed: 264 additions & 0 deletions b/‎IVF_PLAN.md‎
Lines changed: 264 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 15 additions & 0 deletions b/‎Makefile‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎benchmarks-ann/Makefile‎
Lines changed: 15 additions & 15 deletions b/‎benchmarks-ann/Makefile‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎benchmarks-ann/bench.py‎
Lines changed: 42 additions & 0 deletions b/‎benchmarks-ann/bench.py‎
Lines changed: 42 additions & 0 deletions
@@ -0,0 +1,264 @@
+# IVF Index for sqlite-vec
+
+## Overview
+
+IVF (Inverted File Index) is an approximate nearest neighbor index for
+sqlite-vec's `vec0` virtual table. It partitions vectors into clusters via
+k-means, then at query time only scans the nearest clusters instead of all
+vectors. Combined with scalar or binary quantization, this gives 5-20x query
+speedups over brute-force with tunable recall.
+
+## SQL API
+
+### Table Creation
+
+```sql
+CREATE VIRTUAL TABLE vec_items USING vec0(
+  id INTEGER PRIMARY KEY,
+  embedding float[768] distance_metric=cosine
+    INDEXED BY ivf(nlist=128, nprobe=16)
+);
+
+-- With quantization (4x smaller cells, rescore for recall)
+CREATE VIRTUAL TABLE vec_items USING vec0(
+  id INTEGER PRIMARY KEY,
+  embedding float[768] distance_metric=cosine
+    INDEXED BY ivf(nlist=128, nprobe=16, quantizer=int8, oversample=4)
+);
+```
+
+### Parameters
+
+| Parameter | Values | Default | Description |
+|-----------|--------|---------|-------------|
+| `nlist` | 1-65536, or 0 | 128 | Number of k-means clusters. Rule of thumb: `sqrt(N)` |
+| `nprobe` | 1-nlist | 10 | Clusters to search at query time. More = better recall, slower |
+| `quantizer` | `none`, `int8`, `binary` | `none` | How vectors are stored in cells |
+| `oversample` | >= 1 | 1 | Re-rank `oversample * k` candidates with full-precision distance |
+
+### Inserting Vectors
+
+```sql
+-- Works immediately, even before training
+INSERT INTO vec_items(id, embedding) VALUES (1, :vector);
+```
+
+Before centroids exist, vectors go to an "unassigned" partition and queries do
+brute-force. After training, new inserts are assigned to the nearest centroid.
+
+### Training (Computing Centroids)
+
+```sql
+-- Run built-in k-means on all vectors
+INSERT INTO vec_items(id) VALUES ('compute-centroids');
+```
+
+This loads all vectors into memory, runs k-means++ with Lloyd's algorithm,
+creates quantized centroids, and redistributes all vectors into cluster cells.
+It's a blocking operation — run it once after bulk insert.
+
+### Manual Centroid Import
+
+```sql
+-- Import externally-computed centroids
+INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:0', :centroid_0);
+INSERT INTO vec_items(id, embedding) VALUES ('set-centroid:1', :centroid_1);
+
+-- Assign vectors to imported centroids
+INSERT INTO vec_items(id) VALUES ('assign-vectors');
+```
+
+### Runtime Parameter Tuning
+
+```sql
+-- Change nprobe without rebuilding the index
+INSERT INTO vec_items(id) VALUES ('nprobe=32');
+```
+
+### KNN Queries
+
+```sql
+-- Same syntax as standard vec0
+SELECT id, distance
+FROM vec_items
+WHERE embedding MATCH :query AND k = 10;
+```
+
+### Other Commands
+
+```sql
+-- Remove centroids, move all vectors back to unassigned
+INSERT INTO vec_items(id) VALUES ('clear-centroids');
+```
+
+## How It Works
+
+### Architecture
+
+```
+User vector (float32)
+  → quantize to int8/binary (if quantizer != none)
+  → find nearest centroid (quantized distance)
+  → store quantized vector in cell blob
+  → store full vector in KV table (if quantizer != none)
+  → query:
+      1. quantize query vector
+      2. find top nprobe centroids by quantized distance
+      3. scan cell blobs: quantized distance (fast, small I/O)
+      4. if oversample > 1: re-score top N*k with full vectors
+      5. return top k
+```
+
+### Shadow Tables
+
+For a table `vec_items` with vector column index 0:
+
+| Table | Schema | Purpose |
+|-------|--------|---------|
+| `vec_items_ivf_centroids00` | `centroid_id PK, centroid BLOB` | K-means centroids (quantized) |
+| `vec_items_ivf_cells00` | `centroid_id, n_vectors, validity BLOB, rowids BLOB, vectors BLOB` | Packed vector cells, 64 vectors max per row. Multiple rows per centroid. Index on centroid_id. |
+| `vec_items_ivf_rowid_map00` | `rowid PK, cell_id, slot` | Maps vector rowid → cell location for O(1) delete |
+| `vec_items_ivf_vectors00` | `rowid PK, vector BLOB` | Full-precision vectors (only when quantizer != none) |
+
+### Cell Storage
+
+Cells use packed blob storage identical to vec0's chunk layout:
+- **validity**: bitmap (1 bit per slot) marking live vectors
+- **rowids**: packed i64 array
+- **vectors**: packed array of quantized vectors
+
+Cells are capped at 64 vectors (~200KB at 768-dim float32, ~48KB for int8,
+~6KB for binary). When a cell fills, a new row is created for the same
+centroid. This avoids SQLite overflow page traversal which was a 110x
+performance bottleneck with unbounded cells.
+
+### Quantization
+
+**int8**: Each float32 dimension clamped to [-1,1] and scaled to int8
+[-127,127]. 4x storage reduction. Distance computed via int8 L2.
+
+**binary**: Sign-bit quantization — each bit is 1 if the float is positive.
+32x storage reduction. Distance computed via hamming distance.
+
+**Oversample re-ranking**: When `oversample > 1`, the quantized scan collects
+`oversample * k` candidates, then looks up each candidate's full-precision
+vector from the KV table and re-computes exact distance. This recovers nearly
+all recall lost from quantization. At oversample=4 with int8, recall matches
+non-quantized IVF exactly.
+
+### K-Means
+
+Uses Lloyd's algorithm with k-means++ initialization:
+1. K-means++ picks initial centroids weighted by distance
+2. Lloyd's iterations: assign vectors to nearest centroid, recompute centroids as cluster means
+3. Empty cluster handling: reassign to farthest point
+4. K-means runs in float32; centroids are quantized before storage
+
+Training data: recommend 16× nlist vectors. At nlist=1000, that's 16k
+vectors — k-means takes ~140s on 768-dim data.
+
+## Performance
+
+### 100k vectors (COHERE 768-dim cosine)
+
+```
+                          name  qry(ms)  recall
+───────────────────────────────────────────────
+          ivf(q=int8,os=4),p=8    5.3ms  0.934  ← 6x faster than flat
+         ivf(q=int8,os=4),p=16    5.4ms  0.968
+               ivf(q=none),p=8    5.3ms  0.934
+      ivf(q=binary,os=10),p=16    1.3ms  0.832  ← 26x faster than flat
+         ivf(q=int8,os=4),p=32    7.4ms  0.990
+              ivf(q=none),p=32   15.5ms  0.992
+                    int8(os=4)   18.7ms  0.996
+                     bit(os=8)   18.7ms  0.884
+                          flat   33.7ms  1.000
+```
+
+### 1M vectors (COHERE 768-dim cosine)
+
+```
+                            name  insert  train    MB  qry(ms)  recall
+──────────────────────────────────────────────────────────────────────
+            ivf(q=int8,os=4),p=8   163s   142s  4725   16.3ms  0.892
+        ivf(q=binary,os=10),p=16   118s   144s  4073   17.7ms  0.830
+           ivf(q=int8,os=4),p=16   163s   142s  4725   24.3ms  0.950
+           ivf(q=int8,os=4),p=32   163s   142s  4725   41.6ms  0.980
+                 ivf(q=none),p=8   497s   144s  3101   52.1ms  0.890
+                 ivf(q=none),p=16  497s   144s  3101   56.6ms  0.950
+                       bit(os=8)    18s      -  3048   83.5ms  0.918
+                 ivf(q=none),p=32  497s   144s  3101  103.9ms  0.980
+                      int8(os=4)    19s      -  3689  169.1ms  0.994
+                            flat    20s      -  2955  338.0ms  1.000
+```
+
+**Best config at 1M: `ivf(quantizer=int8, oversample=4, nprobe=16)`** —
+24ms query, 0.95 recall, 14x faster than flat, 7x faster than int8 baseline.
+
+### Scaling Characteristics
+
+| Metric | 100k | 1M | Scaling |
+|--------|------|-----|---------|
+| Flat query | 34ms | 338ms | 10x (linear) |
+| IVF int8 p=16 | 5.4ms | 24.3ms | 4.5x (sublinear) |
+| IVF insert rate | ~10k/s | ~6k/s | Slight degradation |
+| Training (nlist=1000) | 13s | 142s | ~11x |
+
+## Implementation
+
+### File Structure
+
+```
+sqlite-vec-ivf-kmeans.c    K-means++ algorithm (pure C, no SQLite deps)
+sqlite-vec-ivf.c           All IVF logic: parser, shadow tables, insert,
+                           delete, query, centroid commands, quantization
+sqlite-vec.c               ~50 lines of additions: struct fields, #includes,
+                           dispatch hooks in parse/create/insert/delete/filter
+```
+
+Both IVF files are `#include`d into `sqlite-vec.c`. No Makefile changes needed.
+
+### Key Design Decisions
+
+1. **Fixed-size cells (64 vectors)** instead of one blob per centroid. Avoids
+   SQLite overflow page traversal which caused 110x insert slowdown.
+
+2. **Multiple cell rows per centroid** with an index on centroid_id. When a
+   cell fills, a new row is created. Query scans all rows for probed centroids
+   via `WHERE centroid_id IN (...)`.
+
+3. **Always store full vectors** when quantizer != none (in `_ivf_vectors` KV
+   table). Enables oversample re-ranking and point queries returning original
+   precision.
+
+4. **K-means in float32, quantize after**. Simpler than quantized k-means,
+   and assignment accuracy doesn't suffer much since nprobe compensates.
+
+5. **NEON SIMD for cosine distance**. Added `cosine_float_neon()` with 4-wide
+   FMA for dot product + magnitudes. Benefits all vec0 queries, not just IVF.
+
+6. **Runtime nprobe tuning**. `INSERT INTO t(id) VALUES ('nprobe=N')` changes
+   the probe count without rebuilding — enables fast parameter sweeps.
+
+### Optimization History
+
+| Optimization | Impact |
+|-------------|--------|
+| Fixed-size cells (64 max) | 110x insert speedup |
+| Skip chunk writes for IVF | 2x DB size reduction |
+| NEON cosine distance | 2x query speedup + 13% recall improvement (correct metric) |
+| Cached prepared statements | Eliminated per-insert prepare/finalize |
+| Batched cell reads (IN clause) | Fewer SQLite queries per KNN |
+| int8 quantization | 2.5x query speedup at same recall |
+| Binary quantization | 32x less cell I/O |
+| Oversample re-ranking | Recovers quantization recall loss |
+
+## Remaining Work
+
+See `ivf-benchmarks/TODO.md` for the full list. Key items:
+
+- **Cache centroids in memory** — each insert re-reads all centroids from SQLite
+- **Runtime oversample** — same pattern as nprobe runtime command
+- **SIMD k-means** — training uses scalar distance, could be 4x faster
+- **Top-k heap** — replace qsort with min-heap for large nprobe
+- **IVF-PQ** — product quantization for better compression/recall tradeoff
@@ -206,6 +206,21 @@ test-loadable-watch:
 test-unit:
 	$(CC) -DSQLITE_CORE -DSQLITE_VEC_TEST -DSQLITE_VEC_ENABLE_RESCORE tests/test-unit.c sqlite-vec.c vendor/sqlite3.c -I./ -Ivendor -o $(prefix)/test-unit && $(prefix)/test-unit
 
+# Standalone sqlite3 CLI with vec0 compiled in. Useful for benchmarking,
+# profiling (has debug symbols), and scripting without .load_extension.
+#   make cli
+#   dist/sqlite3 :memory: "SELECT vec_version()"
+#   dist/sqlite3 < script.sql
+cli: sqlite-vec.h $(prefix)
+	$(CC) -O2 -g \
+	-DSQLITE_CORE \
+	-DSQLITE_EXTRA_INIT=core_init \
+	-DSQLITE_THREADSAFE=0 \
+	-Ivendor/ -I./ \
+	$(CFLAGS) \
+	vendor/sqlite3.c vendor/shell.c sqlite-vec.c examples/sqlite3-cli/core_init.c \
+	-ldl -lm -o $(prefix)/sqlite3
+
 fuzz-build:
 	$(MAKE) -C tests/fuzz all
 
 
@@ -8,27 +8,20 @@ BASELINES = \
 	"brute-int8:type=baseline,variant=int8" \
 	"brute-bit:type=baseline,variant=bit"
 
-# --- Index-specific configs ---
-# Each index branch should add its own configs here. Example:
-#
-# DISKANN_CONFIGS = \
-# 	"diskann-R48-binary:type=diskann,R=48,L=128,quantizer=binary" \
-# 	"diskann-R72-int8:type=diskann,R=72,L=128,quantizer=int8"
-#
-# IVF_CONFIGS = \
-# 	"ivf-n128-p16:type=ivf,nlist=128,nprobe=16"
-#
-# ANNOY_CONFIGS = \
-# 	"annoy-t50:type=annoy,n_trees=50"
+# --- IVF configs ---
+IVF_CONFIGS = \
+	"ivf-n32-p8:type=ivf,nlist=32,nprobe=8" \
+	"ivf-n128-p16:type=ivf,nlist=128,nprobe=16" \
+	"ivf-n512-p32:type=ivf,nlist=512,nprobe=32"
 
 RESCORE_CONFIGS = \
 	"rescore-bit-os8:type=rescore,quantizer=bit,oversample=8" \
 	"rescore-bit-os16:type=rescore,quantizer=bit,oversample=16" \
 	"rescore-int8-os8:type=rescore,quantizer=int8,oversample=8"
 
-ALL_CONFIGS = $(BASELINES) $(RESCORE_CONFIGS)
+ALL_CONFIGS = $(BASELINES) $(RESCORE_CONFIGS) $(IVF_CONFIGS)
 
-.PHONY: seed ground-truth bench-smoke bench-rescore bench-10k bench-50k bench-100k bench-all \
+.PHONY: seed ground-truth bench-smoke bench-rescore bench-ivf bench-10k bench-50k bench-100k bench-all \
         report clean
 
 # --- Data preparation ---
@@ -43,7 +36,8 @@ ground-truth: seed
 # --- Quick smoke test ---
 bench-smoke: seed
 	$(BENCH) --subset-size 5000 -k 10 -n 20 -o runs/smoke \
-		$(BASELINES)
+		"brute-float:type=baseline,variant=float" \
+		"ivf-quick:type=ivf,nlist=16,nprobe=4"
 
 bench-rescore: seed
 	$(BENCH) --subset-size 10000 -k 10 -o runs/rescore \
@@ -62,6 +56,12 @@ bench-100k: seed
 
 bench-all: bench-10k bench-50k bench-100k
 
+# --- IVF across sizes ---
+bench-ivf: seed
+	$(BENCH) --subset-size 10000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
+	$(BENCH) --subset-size 50000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
+	$(BENCH) --subset-size 100000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
+
 # --- Report ---
 report:
 	@echo "Use: sqlite3 runs/<dir>/results.db 'SELECT * FROM bench_results ORDER BY recall DESC'"
 
@@ -173,6 +173,48 @@ def _rescore_describe(params):
 }
 
 
+# ============================================================================
+# IVF implementation
+# ============================================================================
+
+
+def _ivf_create_table_sql(params):
+    return (
+        f"CREATE VIRTUAL TABLE vec_items USING vec0("
+        f"  id integer primary key,"
+        f"  embedding float[768] distance_metric=cosine"
+        f"    indexed by ivf("
+        f"      nlist={params['nlist']},"
+        f"      nprobe={params['nprobe']}"
+        f"    )"
+        f")"
+    )
+
+
+def _ivf_post_insert_hook(conn, params):
+    print("  Training k-means centroids...", flush=True)
+    t0 = time.perf_counter()
+    conn.execute("INSERT INTO vec_items(id) VALUES ('compute-centroids')")
+    conn.commit()
+    elapsed = time.perf_counter() - t0
+    print(f"  Training done in {elapsed:.1f}s", flush=True)
+    return elapsed
+
+
+def _ivf_describe(params):
+    return f"ivf  nlist={params['nlist']:<4} nprobe={params['nprobe']}"
+
+
+INDEX_REGISTRY["ivf"] = {
+    "defaults": {"nlist": 128, "nprobe": 16},
+    "create_table_sql": _ivf_create_table_sql,
+    "insert_sql": None,
+    "post_insert_hook": _ivf_post_insert_hook,
+    "run_query": None,
+    "describe": _ivf_describe,
+}
+
+
 # ============================================================================
 # Config parsing
 # ============================================================================