Skip to content

Commit 0de765f

Browse files
authored
Add ANN search support for vec0 virtual table (#273)
Add approximate nearest neighbor infrastructure to vec0: shared distance dispatch (vec0_distance_full), flat index type with parser, NEON-optimized cosine/Hamming for float32/int8, amalgamation script, and benchmark suite (benchmarks-ann/) with ground-truth generation and profiling tools. Remove unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
1 parent e9f598a commit 0de765f

File tree

27 files changed

+7011
-6950
lines changed

27 files changed

+7011
-6950
lines changed

Makefile

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,11 @@ ifndef OMIT_SIMD
4242
ifeq ($(shell uname -sm),Darwin arm64)
4343
CFLAGS += -mcpu=apple-m1 -DSQLITE_VEC_ENABLE_NEON
4444
endif
45+
ifeq ($(shell uname -s),Linux)
46+
ifneq ($(filter avx,$(shell grep -o 'avx[^ ]*' /proc/cpuinfo 2>/dev/null | head -1)),)
47+
CFLAGS += -mavx -DSQLITE_VEC_ENABLE_AVX
48+
endif
49+
endif
4550
endif
4651

4752
ifdef USE_BREW_SQLITE
@@ -155,6 +160,13 @@ clean:
155160
rm -rf dist
156161

157162

163+
TARGET_AMALGAMATION=$(prefix)/sqlite-vec.c
164+
165+
amalgamation: $(TARGET_AMALGAMATION)
166+
167+
$(TARGET_AMALGAMATION): sqlite-vec.c $(wildcard sqlite-vec-*.c) scripts/amalgamate.py $(prefix)
168+
python3 scripts/amalgamate.py sqlite-vec.c > $@
169+
158170
FORMAT_FILES=sqlite-vec.h sqlite-vec.c
159171
format: $(FORMAT_FILES)
160172
clang-format -i $(FORMAT_FILES)
@@ -174,7 +186,7 @@ evidence-of:
174186
test:
175187
sqlite3 :memory: '.read test.sql'
176188

177-
.PHONY: version loadable static test clean gh-release evidence-of install uninstall
189+
.PHONY: version loadable static test clean gh-release evidence-of install uninstall amalgamation
178190

179191
publish-release:
180192
./scripts/publish-release.sh

TODO.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# TODO: `ann` base branch + consolidated benchmarks
2+
3+
## 1. Create `ann` branch with shared code
4+
5+
### 1.1 Branch setup
6+
- [x] `git checkout -B ann origin/main`
7+
- [x] Cherry-pick `624f998` (vec0_distance_full shared distance dispatch)
8+
- [x] Cherry-pick stdint.h fix for test header
9+
- [ ] Pull NEON cosine optimization from ivf-yolo3 into shared code
10+
- Currently only in ivf branch but is general-purpose (benefits all distance calcs)
11+
- Lives in `distance_cosine_float()`~57 lines of ARM NEON vectorized cosine
12+
13+
### 1.2 Benchmark infrastructure (`benchmarks-ann/`)
14+
- [x] Seed data pipeline (`seed/Makefile`, `seed/build_base_db.py`)
15+
- [x] Ground truth generator (`ground_truth.py`)
16+
- [x] Results schema (`schema.sql`)
17+
- [x] Benchmark runner with `INDEX_REGISTRY` extension point (`bench.py`)
18+
- Baseline configs (float, int8-rescore, bit-rescore) implemented
19+
- Index branches register their types via `INDEX_REGISTRY` dict
20+
- [x] Makefile with baseline targets
21+
- [x] README
22+
23+
### 1.3 Rebase feature branches onto `ann`
24+
- [x] Rebase `diskann-yolo2` onto `ann` (1 commit: DiskANN implementation)
25+
- [x] Rebase `ivf-yolo3` onto `ann` (1 commit: IVF implementation)
26+
- [x] Rebase `annoy-yolo2` onto `ann` (2 commits: Annoy implementation + schema fix)
27+
- [x] Verify each branch has only its index-specific commits remaining
28+
- [ ] Force-push all 4 branches to origin
29+
30+
---
31+
32+
## 2. Per-branch: register index type in benchmarks
33+
34+
Each index branch should add to `benchmarks-ann/` when rebased onto `ann`:
35+
36+
### 2.1 Register in `bench.py`
37+
38+
Add an `INDEX_REGISTRY` entry. Each entry provides:
39+
- `defaults` — default param values
40+
- `create_table_sql(params)` — CREATE VIRTUAL TABLE with INDEXED BY clause
41+
- `insert_sql(params)` — custom insert SQL, or None for default
42+
- `post_insert_hook(conn, params)` — training/building step, returns time
43+
- `run_query(conn, params, query, k)` — custom query, or None for default MATCH
44+
- `describe(params)` — one-line description for report output
45+
46+
### 2.2 Add configs to `Makefile`
47+
48+
Append index-specific config variables and targets. Example pattern:
49+
50+
```makefile
51+
DISKANN_CONFIGS = \
52+
"diskann-R48-binary:type=diskann,R=48,L=128,quantizer=binary" \
53+
...
54+
55+
ALL_CONFIGS += $(DISKANN_CONFIGS)
56+
57+
bench-diskann: seed
58+
$(BENCH) --subset-size 10000 -k 10 -o runs/diskann $(BASELINES) $(DISKANN_CONFIGS)
59+
...
60+
```
61+
62+
### 2.3 Migrate existing benchmark results/docs
63+
64+
- Move useful results docs (RESULTS.md, etc.) into `benchmarks-ann/results/`
65+
- Delete redundant per-branch benchmark directories once consolidated infra is proven
66+
67+
---
68+
69+
## 3. Future improvements
70+
71+
- [ ] Reporting script (`report.py`) — query results.db, produce markdown comparison tables
72+
- [ ] Profiling targets in Makefile (lift from ivf-yolo3's Instruments/perf wrappers)
73+
- [ ] Pre-computed ground truth integration (use GT DB files instead of on-the-fly brute-force)

benchmarks-ann/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.db
2+
runs/

benchmarks-ann/Makefile

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
BENCH = python bench.py
2+
BASE_DB = seed/base.db
3+
EXT = ../dist/vec0
4+
5+
# --- Baseline (brute-force) configs ---
6+
BASELINES = \
7+
"brute-float:type=baseline,variant=float" \
8+
"brute-int8:type=baseline,variant=int8" \
9+
"brute-bit:type=baseline,variant=bit"
10+
11+
# --- Index-specific configs ---
12+
# Each index branch should add its own configs here. Example:
13+
#
14+
# DISKANN_CONFIGS = \
15+
# "diskann-R48-binary:type=diskann,R=48,L=128,quantizer=binary" \
16+
# "diskann-R72-int8:type=diskann,R=72,L=128,quantizer=int8"
17+
#
18+
# IVF_CONFIGS = \
19+
# "ivf-n128-p16:type=ivf,nlist=128,nprobe=16"
20+
#
21+
# ANNOY_CONFIGS = \
22+
# "annoy-t50:type=annoy,n_trees=50"
23+
24+
ALL_CONFIGS = $(BASELINES)
25+
26+
.PHONY: seed ground-truth bench-smoke bench-10k bench-50k bench-100k bench-all \
27+
report clean
28+
29+
# --- Data preparation ---
30+
seed:
31+
$(MAKE) -C seed
32+
33+
ground-truth: seed
34+
python ground_truth.py --subset-size 10000
35+
python ground_truth.py --subset-size 50000
36+
python ground_truth.py --subset-size 100000
37+
38+
# --- Quick smoke test ---
39+
bench-smoke: seed
40+
$(BENCH) --subset-size 5000 -k 10 -n 20 -o runs/smoke \
41+
$(BASELINES)
42+
43+
# --- Standard sizes ---
44+
bench-10k: seed
45+
$(BENCH) --subset-size 10000 -k 10 -o runs/10k $(ALL_CONFIGS)
46+
47+
bench-50k: seed
48+
$(BENCH) --subset-size 50000 -k 10 -o runs/50k $(ALL_CONFIGS)
49+
50+
bench-100k: seed
51+
$(BENCH) --subset-size 100000 -k 10 -o runs/100k $(ALL_CONFIGS)
52+
53+
bench-all: bench-10k bench-50k bench-100k
54+
55+
# --- Report ---
56+
report:
57+
@echo "Use: sqlite3 runs/<dir>/results.db 'SELECT * FROM bench_results ORDER BY recall DESC'"
58+
59+
# --- Cleanup ---
60+
clean:
61+
rm -rf runs/

benchmarks-ann/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# KNN Benchmarks for sqlite-vec
2+
3+
Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force
4+
baselines (float, int8, bit); index-specific branches add their own types
5+
via the `INDEX_REGISTRY` in `bench.py`.
6+
7+
## Prerequisites
8+
9+
- Built `dist/vec0` extension (run `make` from repo root)
10+
- Python 3.10+
11+
- `uv` (for seed data prep): `pip install uv`
12+
13+
## Quick start
14+
15+
```bash
16+
# 1. Download dataset and build seed DB (~3 GB download, ~5 min)
17+
make seed
18+
19+
# 2. Run a quick smoke test (5k vectors, ~1 min)
20+
make bench-smoke
21+
22+
# 3. Run full benchmark at 10k
23+
make bench-10k
24+
```
25+
26+
## Usage
27+
28+
### Direct invocation
29+
30+
```bash
31+
python bench.py --subset-size 10000 \
32+
"brute-float:type=baseline,variant=float" \
33+
"brute-int8:type=baseline,variant=int8" \
34+
"brute-bit:type=baseline,variant=bit"
35+
```
36+
37+
### Config format
38+
39+
`name:type=<index_type>,key=val,key=val`
40+
41+
| Index type | Keys | Branch |
42+
|-----------|------|--------|
43+
| `baseline` | `variant` (float/int8/bit), `oversample` | this branch |
44+
45+
Index branches register additional types in `INDEX_REGISTRY`. See the
46+
docstring in `bench.py` for the extension API.
47+
48+
### Make targets
49+
50+
| Target | Description |
51+
|--------|-------------|
52+
| `make seed` | Download COHERE 1M dataset |
53+
| `make ground-truth` | Pre-compute ground truth for 10k/50k/100k |
54+
| `make bench-smoke` | Quick 5k baseline test |
55+
| `make bench-10k` | All configs at 10k vectors |
56+
| `make bench-50k` | All configs at 50k vectors |
57+
| `make bench-100k` | All configs at 100k vectors |
58+
| `make bench-all` | 10k + 50k + 100k |
59+
60+
## Adding an index type
61+
62+
In your index branch, add an entry to `INDEX_REGISTRY` in `bench.py` and
63+
append your configs to `ALL_CONFIGS` in the `Makefile`. See the existing
64+
`baseline` entry and the comments in both files for the pattern.
65+
66+
## Results
67+
68+
Results are stored in `runs/<dir>/results.db` using the schema in `schema.sql`.
69+
70+
```bash
71+
sqlite3 runs/10k/results.db "
72+
SELECT config_name, recall, mean_ms, qps
73+
FROM bench_results
74+
ORDER BY recall DESC
75+
"
76+
```
77+
78+
## Dataset
79+
80+
[Zilliz COHERE Medium 1M](https://zilliz.com/learn/datasets-for-vector-database-benchmarks):
81+
768 dimensions, cosine distance, 1M train vectors + 10k query vectors with precomputed neighbors.

0 commit comments

Comments
 (0)