|
1 | 1 | # KNN Benchmarks for sqlite-vec |
2 | 2 |
|
3 | 3 | Benchmarking infrastructure for vec0 KNN configurations. Includes brute-force |
4 | | -baselines (float, int8, bit); index-specific branches add their own types |
5 | | -via the `INDEX_REGISTRY` in `bench.py`. |
| 4 | +baselines (float, int8, bit), rescore, IVF, and DiskANN index types. |
| 5 | + |
| 6 | +## Datasets |
| 7 | + |
| 8 | +Each dataset is a subdirectory containing a `Makefile` and `build_base_db.py` |
| 9 | +that produce a `base.db`. The benchmark runner auto-discovers any subdirectory |
| 10 | +with a `base.db` file. |
| 11 | + |
| 12 | +``` |
| 13 | +cohere1m/ # Cohere 768d cosine, 1M vectors |
| 14 | + Makefile # downloads parquets from Zilliz, builds base.db |
| 15 | + build_base_db.py |
| 16 | + base.db # (generated) |
| 17 | +
|
| 18 | +cohere10m/ # Cohere 768d cosine, 10M vectors (10 train shards) |
| 19 | + Makefile # make -j12 download to fetch all shards in parallel |
| 20 | + build_base_db.py |
| 21 | + base.db # (generated) |
| 22 | +``` |
| 23 | + |
| 24 | +Every `base.db` has the same schema: |
| 25 | + |
| 26 | +| Table | Columns | Description | |
| 27 | +|-------|---------|-------------| |
| 28 | +| `train` | `id INTEGER PRIMARY KEY, vector BLOB` | Indexed vectors (f32 blobs) | |
| 29 | +| `query_vectors` | `id INTEGER PRIMARY KEY, vector BLOB` | Query vectors for KNN evaluation | |
| 30 | +| `neighbors` | `query_vector_id INTEGER, rank INTEGER, neighbors_id TEXT` | Ground-truth nearest neighbors | |
| 31 | + |
| 32 | +To add a new dataset, create a directory with a `Makefile` that builds `base.db` |
| 33 | +with the tables above. It will be available via `--dataset <dirname>` automatically. |
| 34 | + |
| 35 | +### Building datasets |
| 36 | + |
| 37 | +```bash |
| 38 | +# Cohere 1M |
| 39 | +cd cohere1m && make download && make && cd .. |
| 40 | + |
| 41 | +# Cohere 10M (parallel download recommended — 10 train shards + test + neighbors) |
| 42 | +cd cohere10m && make -j12 download && make && cd .. |
| 43 | +``` |
6 | 44 |
|
7 | 45 | ## Prerequisites |
8 | 46 |
|
9 | | -- Built `dist/vec0` extension (run `make` from repo root) |
| 47 | +- Built `dist/vec0` extension (run `make loadable` from repo root) |
10 | 48 | - Python 3.10+ |
11 | | -- `uv` (for seed data prep): `pip install uv` |
| 49 | +- `uv` |
12 | 50 |
|
13 | 51 | ## Quick start |
14 | 52 |
|
15 | 53 | ```bash |
16 | | -# 1. Download dataset and build seed DB (~3 GB download, ~5 min) |
17 | | -make seed |
| 54 | +# 1. Build a dataset |
| 55 | +cd cohere1m && make && cd .. |
18 | 56 |
|
19 | | -# 2. Run a quick smoke test (5k vectors, ~1 min) |
| 57 | +# 2. Quick smoke test (5k vectors) |
20 | 58 | make bench-smoke |
21 | 59 |
|
22 | | -# 3. Run full benchmark at 10k |
| 60 | +# 3. Full benchmark at 10k |
23 | 61 | make bench-10k |
24 | 62 | ``` |
25 | 63 |
|
26 | 64 | ## Usage |
27 | 65 |
|
28 | | -### Direct invocation |
29 | | - |
30 | 66 | ```bash |
31 | | -python bench.py --subset-size 10000 \ |
| 67 | +uv run python bench.py --subset-size 10000 -k 10 -n 50 --dataset cohere1m \ |
32 | 68 | "brute-float:type=baseline,variant=float" \ |
33 | | - "brute-int8:type=baseline,variant=int8" \ |
34 | | - "brute-bit:type=baseline,variant=bit" |
| 69 | + "rescore-bit-os8:type=rescore,quantizer=bit,oversample=8" |
35 | 70 | ``` |
36 | 71 |
|
37 | 72 | ### Config format |
38 | 73 |
|
39 | 74 | `name:type=<index_type>,key=val,key=val` |
40 | 75 |
|
41 | | -| Index type | Keys | Branch | |
42 | | -|-----------|------|--------| |
43 | | -| `baseline` | `variant` (float/int8/bit), `oversample` | this branch | |
44 | | - |
45 | | -Index branches register additional types in `INDEX_REGISTRY`. See the |
46 | | -docstring in `bench.py` for the extension API. |
| 76 | +| Index type | Keys | |
| 77 | +|-----------|------| |
| 78 | +| `baseline` | `variant` (float/int8/bit), `oversample` | |
| 79 | +| `rescore` | `quantizer` (bit/int8), `oversample` | |
| 80 | +| `ivf` | `nlist`, `nprobe` | |
| 81 | +| `diskann` | `R`, `L`, `quantizer` (binary/int8), `buffer_threshold` | |
47 | 82 |
|
48 | 83 | ### Make targets |
49 | 84 |
|
50 | 85 | | Target | Description | |
51 | 86 | |--------|-------------| |
52 | | -| `make seed` | Download COHERE 1M dataset | |
53 | | -| `make ground-truth` | Pre-compute ground truth for 10k/50k/100k | |
54 | | -| `make bench-smoke` | Quick 5k baseline test | |
| 87 | +| `make seed` | Download and build default dataset | |
| 88 | +| `make bench-smoke` | Quick 5k test (3 configs) | |
55 | 89 | | `make bench-10k` | All configs at 10k vectors | |
56 | 90 | | `make bench-50k` | All configs at 50k vectors | |
57 | 91 | | `make bench-100k` | All configs at 100k vectors | |
58 | 92 | | `make bench-all` | 10k + 50k + 100k | |
| 93 | +| `make bench-ivf` | Baselines + IVF across 10k/50k/100k | |
| 94 | +| `make bench-diskann` | Baselines + DiskANN across 10k/50k/100k | |
59 | 95 |
|
60 | | -## Adding an index type |
61 | | - |
62 | | -In your index branch, add an entry to `INDEX_REGISTRY` in `bench.py` and |
63 | | -append your configs to `ALL_CONFIGS` in the `Makefile`. See the existing |
64 | | -`baseline` entry and the comments in both files for the pattern. |
65 | | - |
66 | | -## Results |
| 96 | +## Results DB |
67 | 97 |
|
68 | | -Results are stored in `runs/<dir>/results.db` using the schema in `schema.sql`. |
| 98 | +Each run writes to `runs/<dataset>/<subset_size>/results.db` (SQLite, WAL mode). |
| 99 | +Progress is written continuously — query from another terminal to monitor: |
69 | 100 |
|
70 | 101 | ```bash |
71 | | -sqlite3 runs/10k/results.db " |
72 | | - SELECT config_name, recall, mean_ms, qps |
73 | | - FROM bench_results |
74 | | - ORDER BY recall DESC |
75 | | -" |
| 102 | +sqlite3 runs/cohere1m/10000/results.db "SELECT run_id, config_name, status FROM runs" |
76 | 103 | ``` |
77 | 104 |
|
78 | | -## Dataset |
| 105 | +See `results_schema.sql` for the full schema (tables: `runs`, `run_results`, |
| 106 | +`insert_batches`, `queries`). |
| 107 | + |
| 108 | +## Adding an index type |
79 | 109 |
|
80 | | -[Zilliz COHERE Medium 1M](https://zilliz.com/learn/datasets-for-vector-database-benchmarks): |
81 | | -768 dimensions, cosine distance, 1M train vectors + 10k query vectors with precomputed neighbors. |
| 110 | +Add an entry to `INDEX_REGISTRY` in `bench.py` and append configs to |
| 111 | +`ALL_CONFIGS` in the `Makefile`. See existing entries for the pattern. |
0 commit comments