|
| 1 | +# vector-search-bench |
| 2 | + |
| 3 | +Brute-force cosine-similarity benchmark for Vortex on public VectorDBBench |
| 4 | +embedding corpora. |
| 5 | + |
| 6 | +The current benchmark pipeline supports source embedding columns with `f32` |
| 7 | +or `f64` elements. The lower-level `list_to_vector_ext` conversion helper can |
| 8 | +rewrap `f16` lists as `Vector` extension arrays, but `vector-search-bench` |
| 9 | +itself does not yet support `f16` query extraction or the hand-rolled parquet |
| 10 | +baseline. |
| 11 | + |
| 12 | +## What it measures |
| 13 | + |
| 14 | +For each `(dataset, format)` pair, the benchmark records: |
| 15 | + |
| 16 | +1. **`nbytes`** — in-memory footprint of the variant's array tree, in bytes. |
| 17 | + Reporting the in-memory `.nbytes()` instead of an on-disk file size is |
| 18 | + deliberate: the Vortex default write path runs BtrBlocks on every tree |
| 19 | + regardless of whether it's already compressed, so "on-disk size" would |
| 20 | + collapse `vortex-uncompressed` and `vortex-default` to the same bytes |
| 21 | + even though their in-memory trees are different. The `nbytes()` |
| 22 | + number is consistent with what the *compute* measurements actually |
| 23 | + operate on. |
| 24 | + - The `handrolled` baseline reports the canonical parquet file size |
| 25 | + on disk — that's the only encoded representation it has. |
| 26 | +2. **Compress time** — wall time to build the variant tree from the |
| 27 | + materialized uncompressed source. ~0 for `vortex-uncompressed` (identity), |
| 28 | + meaningful for the two compressed variants. |
| 29 | +3. **Decompress time** — wall time to execute the variant tree all the way |
| 30 | + back into a canonical `FixedSizeListArray<f32>` with a materialized f32 |
| 31 | + element buffer. For `vortex-uncompressed` this is a no-op; for |
| 32 | + `vortex-default` it includes ALP-RD bit-unpacking; for |
| 33 | + `vortex-turboquant` it includes the inverse SORF rotation and |
| 34 | + dictionary lookup. |
| 35 | +4. **Cosine-similarity time** — `CosineSimilarity(data, const_query)` |
| 36 | + executed to a materialized f32 array. |
| 37 | +5. **Cosine-filter time** — `Binary(Gt, [CosineSimilarity, threshold])` |
| 38 | + executed to a `BoolArray`. |
| 39 | +6. **Recall@10** (TurboQuant only) — the fraction of the exact top-10 |
| 40 | + nearest neighbours that TurboQuant recovers, using the uncompressed |
| 41 | + Vortex scan as local ground truth. |
| 42 | + |
| 43 | +Before any timing starts, the benchmark runs a **correctness verification |
| 44 | +pass**: cosine scores for a single query are computed against every |
| 45 | +variant and compared to the uncompressed baseline. Lossless variants must |
| 46 | +match within `1e-4` max-abs-diff; TurboQuant must stay within `0.2`. A |
| 47 | +mismatch bails the run — you cannot publish throughput numbers for a |
| 48 | +variant that returns wrong answers. |
| 49 | + |
| 50 | +## Formats |
| 51 | + |
| 52 | +- `handrolled` — Hand-rolled Rust scalar cosine loop over a flat |
| 53 | + `Vec<f32>` that was decoded from the canonical parquet file via |
| 54 | + `parquet-rs` / `arrow-rs`. The **decompress** phase does the parquet |
| 55 | + read, downcasts to `Float32Array`, and memcpies into a plain `Vec<f32>`. |
| 56 | + The **compute** phase is a plain scalar loop over `&[f32]` — no Arrow |
| 57 | + compute kernels, no scalar-function dispatch, no SIMD annotations. |
| 58 | + |
| 59 | + This is a **compute-cost floor**, not a realistic parquet-on-DBMS |
| 60 | + baseline. It answers the question "what's the minimum cost you could |
| 61 | + get away with if you wrote a vector-search scan by hand with no query |
| 62 | + engine?" Real parquet users would pay substantially more (DuckDB |
| 63 | + `list_cosine_similarity`, DataFusion with a vector UDF, etc.) — |
| 64 | + adding those as additional baselines is a natural v2 direction. |
| 65 | +- `vortex-uncompressed` — Raw `Vector<dim, f32>` extension array, no |
| 66 | + encoding-level compression applied. |
| 67 | +- `vortex-default` — `BtrBlocksCompressor::default()` applied to the FSL |
| 68 | + storage child. On float vectors this typically finds ~15% lossless |
| 69 | + savings via ALP-RD (mantissa/exponent split + bitpacking). |
| 70 | +- `vortex-turboquant` — The full |
| 71 | + `L2Denorm(SorfTransform(FSL(Dict(codes, centroids))), norms)` pipeline. |
| 72 | + Lossy; recall@10 is reported alongside throughput. At the default 8-bit |
| 73 | + config this typically gives ~3× storage reduction at >90% top-10 |
| 74 | + recall. |
| 75 | + |
| 76 | +## Datasets |
| 77 | + |
| 78 | +The smallest built-in dataset is **Cohere-100K** (`cohere-small`): 100K |
| 79 | +rows × 768 dims, cosine metric, ~150 MB zstd-parquet. It's the smallest |
| 80 | +VectorDBBench-supplied corpus that still exercises every encoding path. |
| 81 | +Larger variants (`cohere-medium`, `openai-small`, `openai-medium`, |
| 82 | +`bioasq-medium`, `glove-medium`) are wired up for local / on-demand |
| 83 | +experiments; see `vortex-bench/src/vector_dataset.rs` for the full list. |
| 84 | + |
| 85 | +The upstream URL for Cohere-100K is |
| 86 | +`https://assets.zilliz.com/benchmark/cohere_small_100k/train.parquet`. |
| 87 | +The public Zilliz bucket is anonymous-readable so the code can hit it |
| 88 | +directly. |
| 89 | + |
| 90 | +## Running locally |
| 91 | + |
| 92 | +```bash |
| 93 | +cargo run -p vector-search-bench --release -- \ |
| 94 | + --datasets cohere-small \ |
| 95 | + --formats handrolled,vortex-uncompressed,vortex-default,vortex-turboquant \ |
| 96 | + --iterations 5 \ |
| 97 | + -d table |
| 98 | +``` |
| 99 | + |
| 100 | +The first run downloads the parquet file into |
| 101 | +`vortex-bench/data/cohere-small/cohere-small.parquet` and caches it |
| 102 | +idempotently for subsequent runs. |
| 103 | + |
| 104 | +## CI note: dataset mirror |
| 105 | + |
| 106 | +CI runs after every develop-branch merge. Hitting `assets.zilliz.com` |
| 107 | +from every merge would create recurring egress traffic on a third-party |
| 108 | +bucket — the same courtesy reason `RPlace` / `AirQuality` are excluded |
| 109 | +from CI in `compress-bench`. |
| 110 | + |
| 111 | +Before enabling the `vector-search-bench` entry in `.github/workflows/bench.yml` |
| 112 | +on a fork, either: |
| 113 | + |
| 114 | +1. **Mirror the file into an internal bucket** and swap the URL in |
| 115 | + `vortex-bench/src/vector_dataset.rs::VectorDataset::parquet_url`, or |
| 116 | +2. **Accept the upstream egress cost** and leave the URL as-is. |
| 117 | + |
| 118 | +The mirror step is a one-off `aws s3 cp` and is documented here rather |
| 119 | +than automated in the build because the destination bucket is |
| 120 | +organization-specific. |
0 commit comments