Skip to content

Commit 8544081

Browse files
authored
Add comprehensive ANN benchmarking suite (#279)
Extend benchmarks-ann/ with results database (SQLite with per-query detail and continuous writes), dataset subfolder organization, --subset-size and --warmup options. Supports systematic comparison across flat, rescore, IVF, and DiskANN index types.
1 parent a248ecd commit 8544081

26 files changed

Lines changed: 2126 additions & 291 deletions

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,6 @@ poetry.lock
3131

3232
memstat.c
3333
memstat.*
34+
35+
36+
.DS_Store

benchmarks-ann/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,8 @@
11
*.db
2+
*.db-shm
3+
*.db-wal
4+
*.parquet
25
runs/
6+
7+
viewer/
8+
searcher/

benchmarks-ann/Makefile

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
BENCH = python bench.py
2-
BASE_DB = seed/base.db
2+
BASE_DB = cohere1m/base.db
33
EXT = ../dist/vec0
44

55
# --- Baseline (brute-force) configs ---
@@ -33,7 +33,7 @@ ALL_CONFIGS = $(BASELINES) $(RESCORE_CONFIGS) $(IVF_CONFIGS) $(DISKANN_CONFIGS)
3333

3434
# --- Data preparation ---
3535
seed:
36-
$(MAKE) -C seed
36+
$(MAKE) -C cohere1m
3737

3838
ground-truth: seed
3939
python ground_truth.py --subset-size 10000
@@ -42,43 +42,43 @@ ground-truth: seed
4242

4343
# --- Quick smoke test ---
4444
bench-smoke: seed
45-
$(BENCH) --subset-size 5000 -k 10 -n 20 -o runs/smoke \
45+
$(BENCH) --subset-size 5000 -k 10 -n 20 --dataset cohere1m -o runs \
4646
"brute-float:type=baseline,variant=float" \
4747
"ivf-quick:type=ivf,nlist=16,nprobe=4" \
4848
"diskann-quick:type=diskann,R=48,L=64,quantizer=binary"
4949

5050
bench-rescore: seed
51-
$(BENCH) --subset-size 10000 -k 10 -o runs/rescore \
51+
$(BENCH) --subset-size 10000 -k 10 --dataset cohere1m -o runs \
5252
$(RESCORE_CONFIGS)
5353

5454

5555
# --- Standard sizes ---
5656
bench-10k: seed
57-
$(BENCH) --subset-size 10000 -k 10 -o runs/10k $(ALL_CONFIGS)
57+
$(BENCH) --subset-size 10000 -k 10 --dataset cohere1m -o runs $(ALL_CONFIGS)
5858

5959
bench-50k: seed
60-
$(BENCH) --subset-size 50000 -k 10 -o runs/50k $(ALL_CONFIGS)
60+
$(BENCH) --subset-size 50000 -k 10 --dataset cohere1m -o runs $(ALL_CONFIGS)
6161

6262
bench-100k: seed
63-
$(BENCH) --subset-size 100000 -k 10 -o runs/100k $(ALL_CONFIGS)
63+
$(BENCH) --subset-size 100000 -k 10 --dataset cohere1m -o runs $(ALL_CONFIGS)
6464

6565
bench-all: bench-10k bench-50k bench-100k
6666

6767
# --- IVF across sizes ---
6868
bench-ivf: seed
69-
$(BENCH) --subset-size 10000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
70-
$(BENCH) --subset-size 50000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
71-
$(BENCH) --subset-size 100000 -k 10 -o runs/ivf $(BASELINES) $(IVF_CONFIGS)
69+
$(BENCH) --subset-size 10000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(IVF_CONFIGS)
70+
$(BENCH) --subset-size 50000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(IVF_CONFIGS)
71+
$(BENCH) --subset-size 100000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(IVF_CONFIGS)
7272

7373
# --- DiskANN across sizes ---
7474
bench-diskann: seed
75-
$(BENCH) --subset-size 10000 -k 10 -o runs/diskann $(BASELINES) $(DISKANN_CONFIGS)
76-
$(BENCH) --subset-size 50000 -k 10 -o runs/diskann $(BASELINES) $(DISKANN_CONFIGS)
77-
$(BENCH) --subset-size 100000 -k 10 -o runs/diskann $(BASELINES) $(DISKANN_CONFIGS)
75+
$(BENCH) --subset-size 10000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(DISKANN_CONFIGS)
76+
$(BENCH) --subset-size 50000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(DISKANN_CONFIGS)
77+
$(BENCH) --subset-size 100000 -k 10 --dataset cohere1m -o runs $(BASELINES) $(DISKANN_CONFIGS)
7878

7979
# --- Report ---
8080
report:
81-
@echo "Use: sqlite3 runs/<dir>/results.db 'SELECT * FROM bench_results ORDER BY recall DESC'"
81+
@echo "Use: sqlite3 runs/cohere1m/<size>/results.db 'SELECT run_id, config_name, status, recall FROM runs JOIN run_results USING(run_id)'"
8282

8383
# --- Cleanup ---
8484
clean:

benchmarks-ann/README.md

Lines changed: 68 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,111 @@
11
# KNN Benchmarks for sqlite-vec
22

33
Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force
4-
baselines (float, int8, bit); index-specific branches add their own types
5-
via the `INDEX_REGISTRY` in `bench.py`.
4+
baselines (float, int8, bit), rescore, IVF, and DiskANN index types.
5+
6+
## Datasets
7+
8+
Each dataset is a subdirectory containing a `Makefile` and `build_base_db.py`
9+
that produce a `base.db`. The benchmark runner auto-discovers any subdirectory
10+
with a `base.db` file.
11+
12+
```
13+
cohere1m/ # Cohere 768d cosine, 1M vectors
14+
Makefile # downloads parquets from Zilliz, builds base.db
15+
build_base_db.py
16+
base.db # (generated)
17+
18+
cohere10m/ # Cohere 768d cosine, 10M vectors (10 train shards)
19+
Makefile # make -j12 download to fetch all shards in parallel
20+
build_base_db.py
21+
base.db # (generated)
22+
```
23+
24+
Every `base.db` has the same schema:
25+
26+
| Table | Columns | Description |
27+
|-------|---------|-------------|
28+
| `train` | `id INTEGER PRIMARY KEY, vector BLOB` | Indexed vectors (f32 blobs) |
29+
| `query_vectors` | `id INTEGER PRIMARY KEY, vector BLOB` | Query vectors for KNN evaluation |
30+
| `neighbors` | `query_vector_id INTEGER, rank INTEGER, neighbors_id TEXT` | Ground-truth nearest neighbors |
31+
32+
To add a new dataset, create a directory with a `Makefile` that builds `base.db`
33+
with the tables above. It will be available via `--dataset <dirname>` automatically.
34+
35+
### Building datasets
36+
37+
```bash
38+
# Cohere 1M
39+
cd cohere1m && make download && make && cd ..
40+
41+
# Cohere 10M (parallel download recommended — 10 train shards + test + neighbors)
42+
cd cohere10m && make -j12 download && make && cd ..
43+
```
644

745
## Prerequisites
846

9-
- Built `dist/vec0` extension (run `make` from repo root)
47+
- Built `dist/vec0` extension (run `make loadable` from repo root)
1048
- Python 3.10+
11-
- `uv` (for seed data prep): `pip install uv`
49+
- `uv`
1250

1351
## Quick start
1452

1553
```bash
16-
# 1. Download dataset and build seed DB (~3 GB download, ~5 min)
17-
make seed
54+
# 1. Build a dataset
55+
cd cohere1m && make && cd ..
1856

19-
# 2. Run a quick smoke test (5k vectors, ~1 min)
57+
# 2. Quick smoke test (5k vectors)
2058
make bench-smoke
2159

22-
# 3. Run full benchmark at 10k
60+
# 3. Full benchmark at 10k
2361
make bench-10k
2462
```
2563

2664
## Usage
2765

28-
### Direct invocation
29-
3066
```bash
31-
python bench.py --subset-size 10000 \
67+
uv run python bench.py --subset-size 10000 -k 10 -n 50 --dataset cohere1m \
3268
"brute-float:type=baseline,variant=float" \
33-
"brute-int8:type=baseline,variant=int8" \
34-
"brute-bit:type=baseline,variant=bit"
69+
"rescore-bit-os8:type=rescore,quantizer=bit,oversample=8"
3570
```
3671

3772
### Config format
3873

3974
`name:type=<index_type>,key=val,key=val`
4075

41-
| Index type | Keys | Branch |
42-
|-----------|------|--------|
43-
| `baseline` | `variant` (float/int8/bit), `oversample` | this branch |
44-
45-
Index branches register additional types in `INDEX_REGISTRY`. See the
46-
docstring in `bench.py` for the extension API.
76+
| Index type | Keys |
77+
|-----------|------|
78+
| `baseline` | `variant` (float/int8/bit), `oversample` |
79+
| `rescore` | `quantizer` (bit/int8), `oversample` |
80+
| `ivf` | `nlist`, `nprobe` |
81+
| `diskann` | `R`, `L`, `quantizer` (binary/int8), `buffer_threshold` |
4782

4883
### Make targets
4984

5085
| Target | Description |
5186
|--------|-------------|
52-
| `make seed` | Download COHERE 1M dataset |
53-
| `make ground-truth` | Pre-compute ground truth for 10k/50k/100k |
54-
| `make bench-smoke` | Quick 5k baseline test |
87+
| `make seed` | Download and build default dataset |
88+
| `make bench-smoke` | Quick 5k test (3 configs) |
5589
| `make bench-10k` | All configs at 10k vectors |
5690
| `make bench-50k` | All configs at 50k vectors |
5791
| `make bench-100k` | All configs at 100k vectors |
5892
| `make bench-all` | 10k + 50k + 100k |
93+
| `make bench-ivf` | Baselines + IVF across 10k/50k/100k |
94+
| `make bench-diskann` | Baselines + DiskANN across 10k/50k/100k |
5995

60-
## Adding an index type
61-
62-
In your index branch, add an entry to `INDEX_REGISTRY` in `bench.py` and
63-
append your configs to `ALL_CONFIGS` in the `Makefile`. See the existing
64-
`baseline` entry and the comments in both files for the pattern.
65-
66-
## Results
96+
## Results DB
6797

68-
Results are stored in `runs/<dir>/results.db` using the schema in `schema.sql`.
98+
Each run writes to `runs/<dataset>/<subset_size>/results.db` (SQLite, WAL mode).
99+
Progress is written continuously — query from another terminal to monitor:
69100

70101
```bash
71-
sqlite3 runs/10k/results.db "
72-
SELECT config_name, recall, mean_ms, qps
73-
FROM bench_results
74-
ORDER BY recall DESC
75-
"
102+
sqlite3 runs/cohere1m/10000/results.db "SELECT run_id, config_name, status FROM runs"
76103
```
77104

78-
## Dataset
105+
See `results_schema.sql` for the full schema (tables: `runs`, `run_results`,
106+
`insert_batches`, `queries`).
107+
108+
## Adding an index type
79109

80-
[Zilliz COHERE Medium 1M](https://zilliz.com/learn/datasets-for-vector-database-benchmarks):
81-
768 dimensions, cosine distance, 1M train vectors + 10k query vectors with precomputed neighbors.
110+
Add an entry to `INDEX_REGISTRY` in `bench.py` and append configs to
111+
`ALL_CONFIGS` in the `Makefile`. See existing entries for the pattern.

0 commit comments

Comments
 (0)