|
1 | 1 | # vector-search-bench |
2 | 2 |
|
3 | | -Brute-force cosine-similarity benchmark for Vortex on public VectorDBBench |
4 | | -embedding corpora. |
| 3 | +On-disk cosine-similarity scan benchmark for Vortex on public VectorDBBench |
| 4 | +embedding corpora. The benchmark writes one `.vortex` file per train shard per |
| 5 | +flavor and then issues filtered scans against the resulting files, so the |
| 6 | +numbers reflect realistic out-of-memory workloads — not in-memory `ArrayRef` |
| 7 | +manipulation. |
5 | 8 |
|
6 | | -The current benchmark pipeline supports source embedding columns with `f32` |
7 | | -or `f64` elements. The lower-level `list_to_vector_ext` conversion helper can |
8 | | -rewrap `f16` lists as `Vector` extension arrays, but `vector-search-bench` |
9 | | -itself does not yet support `f16` query extraction or the hand-rolled parquet |
10 | | -baseline. |
11 | | - |
12 | | -## What it measures |
13 | | - |
14 | | -For each `(dataset, format)` pair, the benchmark records: |
15 | | - |
16 | | -1. **`nbytes`** — in-memory footprint of the variant's array tree, in bytes. |
17 | | - Reporting the in-memory `.nbytes()` instead of an on-disk file size is |
18 | | - deliberate: the Vortex default write path runs BtrBlocks on every tree |
19 | | - regardless of whether it's already compressed, so "on-disk size" would |
20 | | - collapse `vortex-uncompressed` and `vortex-default` to the same bytes |
21 | | - even though their in-memory trees are different. The `nbytes()` |
22 | | - number is consistent with what the *compute* measurements actually |
23 | | - operate on. |
24 | | - - The `handrolled` baseline reports the canonical parquet file size |
25 | | - on disk — that's the only encoded representation it has. |
26 | | -2. **Compress time** — wall time to build the variant tree from the |
27 | | - materialized uncompressed source. ~0 for `vortex-uncompressed` (identity), |
28 | | - meaningful for the two compressed variants. |
29 | | -3. **Decompress time** — wall time to execute the variant tree all the way |
30 | | - back into a canonical `FixedSizeListArray<f32>` with a materialized f32 |
31 | | - element buffer. For `vortex-uncompressed` this is a no-op; for |
32 | | - `vortex-default` it includes ALP-RD bit-unpacking; for |
33 | | - `vortex-turboquant` it includes the inverse SORF rotation and |
34 | | - dictionary lookup. |
35 | | -4. **Cosine-similarity time** — `CosineSimilarity(data, const_query)` |
36 | | - executed to a materialized f32 array. |
37 | | -5. **Cosine-filter time** — `Binary(Gt, [CosineSimilarity, threshold])` |
38 | | - executed to a `BoolArray`. |
39 | | -6. **Recall@10** (TurboQuant only) — the fraction of the exact top-10 |
40 | | - nearest neighbours that TurboQuant recovers, using the uncompressed |
41 | | - Vortex scan as local ground truth. |
42 | | - |
43 | | -Before any timing starts, the benchmark runs a **correctness verification |
44 | | -pass**: cosine scores for a single query are computed against every |
45 | | -variant and compared to the uncompressed baseline. Lossless variants must |
46 | | -match within `1e-4` max-abs-diff; TurboQuant must stay within `0.2`. A |
47 | | -mismatch bails the run — you cannot publish throughput numbers for a |
48 | | -variant that returns wrong answers. |
49 | | - |
50 | | -## Formats |
51 | | - |
52 | | -- `handrolled` — Hand-rolled Rust scalar cosine loop over a flat |
53 | | - `Vec<f32>` that was decoded from the canonical parquet file via |
54 | | - `parquet-rs` / `arrow-rs`. The **decompress** phase does the parquet |
55 | | - read, downcasts to `Float32Array`, and memcpies into a plain `Vec<f32>`. |
56 | | - The **compute** phase is a plain scalar loop over `&[f32]` — no Arrow |
57 | | - compute kernels, no scalar-function dispatch, no SIMD annotations. |
58 | | - |
59 | | - This is a **compute-cost floor**, not a realistic parquet-on-DBMS |
60 | | - baseline. It answers the question "what's the minimum cost you could |
61 | | - get away with if you wrote a vector-search scan by hand with no query |
62 | | - engine?" Real parquet users would pay substantially more (DuckDB |
63 | | - `list_cosine_similarity`, DataFusion with a vector UDF, etc.) — |
64 | | - adding those as additional baselines is a natural v2 direction. |
65 | | -- `vortex-uncompressed` — Raw `Vector<dim, f32>` extension array, no |
66 | | - encoding-level compression applied. |
67 | | -- `vortex-default` — `BtrBlocksCompressor::default()` applied to the FSL |
68 | | - storage child. On float vectors this typically finds ~15% lossless |
69 | | - savings via ALP-RD (mantissa/exponent split + bitpacking). |
70 | | -- `vortex-turboquant` — The full |
71 | | - `L2Denorm(SorfTransform(FSL(Dict(codes, centroids))), norms)` pipeline. |
72 | | - Lossy; recall@10 is reported alongside throughput. At the default 8-bit |
73 | | - config this typically gives ~3× storage reduction at >90% top-10 |
74 | | - recall. |
75 | | - |
76 | | -## Datasets |
77 | | - |
78 | | -The smallest built-in dataset is **Cohere-100K** (`cohere-small`): 100K |
79 | | -rows × 768 dims, cosine metric, ~150 MB zstd-parquet. It's the smallest |
80 | | -VectorDBBench-supplied corpus that still exercises every encoding path. |
81 | | -Larger variants (`cohere-medium`, `openai-small`, `openai-medium`, |
82 | | -`bioasq-medium`, `glove-medium`) are wired up for local / on-demand |
83 | | -experiments; see `vortex-bench/src/vector_dataset.rs` for the full list. |
84 | | - |
85 | | -The upstream URL for Cohere-100K is |
86 | | -`https://assets.zilliz.com/benchmark/cohere_small_100k/train.parquet`. |
87 | | -The public Zilliz bucket is anonymous-readable so the code can hit it |
88 | | -directly. |
89 | | - |
90 | | -## Running locally |
| 9 | +## Quick start |
91 | 10 |
|
92 | 11 | ```bash |
93 | 12 | cargo run -p vector-search-bench --release -- \ |
94 | | - --datasets cohere-small \ |
95 | | - --formats handrolled,vortex-uncompressed,vortex-default,vortex-turboquant \ |
96 | | - --iterations 5 \ |
97 | | - -d table |
| 13 | + --dataset cohere-small-100k \ |
| 14 | + --flavors vortex-uncompressed,vortex-turboquant,handrolled \ |
| 15 | + --iterations 3 \ |
| 16 | + --threshold 0.8 |
98 | 17 | ``` |
99 | 18 |
|
100 | | -The first run downloads the parquet file into |
101 | | -`vortex-bench/data/cohere-small/cohere-small.parquet` and caches it |
102 | | -idempotently for subsequent runs. |
| 19 | +The first run downloads the parquet shards into |
| 20 | +`vortex-bench/data/vector-search/<dataset>/<layout>/train/...`, ingests them |
| 21 | +into per-flavor `.vortex` files in sibling directories, samples a query row |
| 22 | +from `test.parquet`, and runs the timed scan loop. |
103 | 23 |
|
104 | | -## CI note: dataset mirror |
| 24 | +A datasets that publishes more than one layout (e.g. `cohere-large-10m` |
| 25 | +hosts both `partitioned` and `partitioned-shuffled`) requires `--layout` to |
| 26 | +disambiguate. |
105 | 27 |
|
106 | | -CI runs after every develop-branch merge. Hitting `assets.zilliz.com` |
107 | | -from every merge would create recurring egress traffic on a third-party |
108 | | -bucket — the same courtesy reason `RPlace` / `AirQuality` are excluded |
109 | | -from CI in `compress-bench`. |
| 28 | +## What it measures |
110 | 29 |
|
111 | | -Before enabling the `vector-search-bench` entry in `.github/workflows/bench.yml` |
112 | | -on a fork, either: |
| 30 | +Per `(dataset, flavor)`: |
| 31 | + |
| 32 | +| Metric | What it is | |
| 33 | +|---------------------|---------------------------------------------------------| |
| 34 | +| compress wall | Sum of per-shard write time (parquet → `.vortex`). | |
| 35 | +| input bytes | Sum of input parquet shard sizes. | |
| 36 | +| output bytes | Sum of output `.vortex` shard sizes. | |
| 37 | +| compression ratio | input bytes / output bytes. | |
| 38 | +| scan wall (best) | Best-of-N wall-clock for the per-iteration scan. | |
| 39 | +| scan wall (median) | Median wall-clock for the per-iteration scan. | |
| 40 | +| matches | Rows that survived `cosine(emb, query) > threshold`. | |
| 41 | +| rows scanned | Total rows in the `.vortex` files (sanity check). | |
| 42 | +| rows / sec | rows scanned / scan wall (best). | |
| 43 | +| recall@K (mean/p05) | Only emitted when `--recall` is passed (lossy flavors). | |
| 44 | + |
| 45 | +## Flavors |
| 46 | + |
| 47 | +- **`vortex-uncompressed`** — `BtrBlocksCompressorBuilder::empty()`. Vortex |
| 48 | + framing with no compression schemes registered, so the `emb` column lands |
| 49 | + as canonical `FixedSizeList<f32>` on disk. Lossless ceiling on the size |
| 50 | + axis. |
| 51 | +- **`vortex-turboquant`** — `BtrBlocksCompressorBuilder::empty().with_turboquant()`. |
| 52 | + Only the TurboQuant scheme is registered, so the `emb` column ends up |
| 53 | + wrapped as `L2Denorm(SorfTransform(FixedSizeList(Dict)))`. Lossy; significant |
| 54 | + size win. |
| 55 | +- **`handrolled`** — Sequential parquet scan + 4-way unrolled scalar cosine |
| 56 | + loop over a flat `Vec<f32>` (decoded via `parquet-rs` / `arrow-rs`). This |
| 57 | + is a *compute-cost floor*, not a realistic parquet-on-DBMS baseline. Real |
| 58 | + parquet users would pay substantially more (DuckDB |
| 59 | + `list_cosine_similarity`, DataFusion with a vector UDF, etc.) — adding |
| 60 | + those as additional baselines is a natural future direction. |
| 61 | + |
| 62 | +The benchmark always operates in `f32`. The ingest pipeline casts `f64` |
| 63 | +sources (e.g. OpenAI corpora) to `f32` once at write time, so all downstream |
| 64 | +code is uniformly `f32`. |
113 | 65 |
|
114 | | -1. **Mirror the file into an internal bucket** and swap the URL in |
115 | | - `vortex-bench/src/vector_dataset.rs::VectorDataset::parquet_url`, or |
116 | | -2. **Accept the upstream egress cost** and leave the URL as-is. |
| 66 | +## Datasets |
117 | 67 |
|
118 | | -The mirror step is a one-off `aws s3 cp` and is documented here rather |
119 | | -than automated in the build because the destination bucket is |
120 | | -organization-specific. |
| 68 | +All 16 published VectorDBBench corpora are wired into the catalog, with |
| 69 | +explicit declarations of which train-split layouts upstream actually hosts. |
| 70 | +See `vortex-bench/src/vector_dataset/catalog.rs` for the full table. CLI |
| 71 | +helpfully lists choices when run with `--help`. |
| 72 | + |
| 73 | +| Dataset | dim | rows | layouts | |
| 74 | +|--------------------|------|------|---------------------------------------------| |
| 75 | +| cohere-small-100k | 768 | 100K | single, single-shuffled | |
| 76 | +| cohere-medium-1m | 768 | 1M | single, single-shuffled | |
| 77 | +| cohere-large-10m | 768 | 10M | partitioned (10), partitioned-shuffled (10) | |
| 78 | +| openai-small-50k | 1536 | 50K | single, single-shuffled | |
| 79 | +| openai-medium-500k | 1536 | 500K | single, single-shuffled | |
| 80 | +| openai-large-5m | 1536 | 5M | partitioned (10), partitioned-shuffled (10) | |
| 81 | +| bioasq-medium-1m | 1024 | 1M | single-shuffled | |
| 82 | +| bioasq-large-10m | 1024 | 10M | partitioned-shuffled (10) | |
| 83 | +| glove-{small,medium}, gist-{small,medium} | varies | varies | single only | |
| 84 | +| sift-small-500k | 128 | 500K | single | |
| 85 | +| sift-medium-5m | 128 | 5M | single | |
| 86 | +| sift-large-50m | 128 | 50M | partitioned (50) | |
| 87 | +| laion-large-100m | 768 | 100M | partitioned (100) | |
| 88 | + |
| 89 | +## Recall@K |
| 90 | + |
| 91 | +Pass `--recall --recall-k 10 --recall-queries 100` to measure recall against |
| 92 | +`neighbors.parquet`. The lossless `vortex-uncompressed` flavor is skipped |
| 93 | +because its recall is 1.0 by construction; only `vortex-turboquant` is |
| 94 | +measured. Datasets that don't host `neighbors.parquet` (sift, glove, gist) |
| 95 | +bail out when `--recall` is set. |
| 96 | + |
| 97 | +## Future work |
| 98 | + |
| 99 | +1. Native `f64` flavor — drop the prepare-time downcast for OpenAI datasets. |
| 100 | +2. `--decompress-only` mode — project + drain, no filter — for pure decode |
| 101 | + timing. |
| 102 | +3. Filtered scans via `scalar_labels` (already projected through the ingest |
| 103 | + pipeline; the `neighbors_int_*p.parquet` and `neighbors_labels_*.parquet` |
| 104 | + ground-truth files exist for verification). |
| 105 | +4. DuckDB / DataFusion parquet baselines — real engines, not just hand-rolled. |
| 106 | +5. MSE-vs-ground-truth correctness mode (catches "right top-K, wrong scores"). |
| 107 | +6. Promote the cosine-filter expression helpers from `expression.rs` into |
| 108 | + `vortex-tensor::vector_search` if a second caller materializes. |
0 commit comments