Skip to content

Commit 207c05b

Browse files
connortsui20claude
andcommitted
vector search benchmarks
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent f747201 commit 207c05b

18 files changed

Lines changed: 3908 additions & 10 deletions

File tree

.github/workflows/bench-pr.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ jobs:
3333
build_args: "--features lance"
3434
- id: compress-bench
3535
name: Compression
36+
- id: vector-search-bench
37+
name: Vector Similarity Search
3638
steps:
3739
- uses: runs-on/action@v2
3840
if: github.event.pull_request.head.repo.fork == false

.github/workflows/bench.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ jobs:
4949
name: Compression
5050
build_args: "--features lance"
5151
formats: "parquet,lance,vortex"
52+
- id: vector-search-bench
53+
name: Vector Similarity Search
54+
build_args: ""
55+
formats: "handrolled,vortex-uncompressed,vortex-default,vortex-turboquant"
5256
steps:
5357
- uses: runs-on/action@v2
5458
if: github.repository == 'vortex-data/vortex'

Cargo.lock

Lines changed: 22 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ members = [
5959
"benchmarks/datafusion-bench",
6060
"benchmarks/duckdb-bench",
6161
"benchmarks/random-access-bench",
62+
"benchmarks/vector-search-bench",
6263
]
6364
exclude = ["java/testfiles", "wasm-test"]
6465
resolver = "2"

benchmarks/datafusion-bench/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ pub fn format_to_df_format(format: Format) -> Arc<dyn FileFormat> {
114114
Format::Csv => Arc::new(CsvFormat::default()) as _,
115115
Format::Arrow => Arc::new(ArrowFormat),
116116
Format::Parquet => Arc::new(ParquetFormat::new()),
117-
Format::OnDiskVortex | Format::VortexCompact => {
117+
Format::OnDiskVortex | Format::VortexCompact | Format::VortexLossy => {
118118
Arc::new(VortexFormat::new(SESSION.clone()))
119119
}
120120
Format::OnDiskDuckDB | Format::Lance => {
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
[package]
2+
name = "vector-search-bench"
3+
description = "Vector similarity search benchmarks for Vortex on public embedding datasets"
4+
authors.workspace = true
5+
categories.workspace = true
6+
edition.workspace = true
7+
homepage.workspace = true
8+
include.workspace = true
9+
keywords.workspace = true
10+
license.workspace = true
11+
readme.workspace = true
12+
repository.workspace = true
13+
rust-version.workspace = true
14+
version.workspace = true
15+
publish = false
16+
17+
[dependencies]
18+
anyhow = { workspace = true }
19+
arrow-array = { workspace = true }
20+
arrow-buffer = { workspace = true }
21+
arrow-schema = { workspace = true }
22+
clap = { workspace = true, features = ["derive"] }
23+
indicatif = { workspace = true }
24+
parquet = { workspace = true }
25+
tabled = { workspace = true, features = ["std"] }
26+
tokio = { workspace = true, features = ["full"] }
27+
tracing = { workspace = true }
28+
vortex = { workspace = true }
29+
vortex-bench = { workspace = true }
30+
vortex-btrblocks = { workspace = true }
31+
vortex-tensor = { workspace = true }
32+
33+
[dev-dependencies]
34+
tempfile = { workspace = true }
35+
36+
[lints]
37+
workspace = true
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# vector-search-bench
2+
3+
Brute-force cosine-similarity benchmark for Vortex on public VectorDBBench
4+
embedding corpora.
5+
6+
The current benchmark pipeline supports source embedding columns with `f32`
7+
or `f64` elements. The lower-level `list_to_vector_ext` conversion helper can
8+
rewrap `f16` lists as `Vector` extension arrays, but `vector-search-bench`
9+
itself does not yet support `f16` query extraction or the hand-rolled parquet
10+
baseline.
11+
12+
## What it measures
13+
14+
For each `(dataset, format)` pair, the benchmark records:
15+
16+
1. **`nbytes`** — in-memory footprint of the variant's array tree, in bytes.
17+
Reporting the in-memory `.nbytes()` instead of an on-disk file size is
18+
deliberate: the Vortex default write path runs BtrBlocks on every tree
19+
regardless of whether it's already compressed, so "on-disk size" would
20+
collapse `vortex-uncompressed` and `vortex-default` to the same bytes
21+
even though their in-memory trees are different. The `nbytes()`
22+
number is consistent with what the *compute* measurements actually
23+
operate on.
24+
- The `handrolled` baseline reports the canonical parquet file size
25+
on disk — that's the only encoded representation it has.
26+
2. **Compress time** — wall time to build the variant tree from the
27+
materialized uncompressed source. ~0 for `vortex-uncompressed` (identity),
28+
meaningful for the two compressed variants.
29+
3. **Decompress time** — wall time to execute the variant tree all the way
30+
back into a canonical `FixedSizeListArray<f32>` with a materialized f32
31+
element buffer. For `vortex-uncompressed` this is a no-op; for
32+
`vortex-default` it includes ALP-RD bit-unpacking; for
33+
`vortex-turboquant` it includes the inverse SORF rotation and
34+
dictionary lookup.
35+
4. **Cosine-similarity time**`CosineSimilarity(data, const_query)`
36+
executed to a materialized f32 array.
37+
5. **Cosine-filter time**`Binary(Gt, [CosineSimilarity, threshold])`
38+
executed to a `BoolArray`.
39+
6. **Recall@10** (TurboQuant only) — the fraction of the exact top-10
40+
nearest neighbours that TurboQuant recovers, using the uncompressed
41+
Vortex scan as local ground truth.
42+
43+
Before any timing starts, the benchmark runs a **correctness verification
44+
pass**: cosine scores for a single query are computed against every
45+
variant and compared to the uncompressed baseline. Lossless variants must
46+
match within `1e-4` max-abs-diff; TurboQuant must stay within `0.2`. A
47+
mismatch bails the run — you cannot publish throughput numbers for a
48+
variant that returns wrong answers.
49+
50+
## Formats
51+
52+
- `handrolled` — Hand-rolled Rust scalar cosine loop over a flat
53+
`Vec<f32>` that was decoded from the canonical parquet file via
54+
`parquet-rs` / `arrow-rs`. The **decompress** phase does the parquet
55+
read, downcasts to `Float32Array`, and memcpies into a plain `Vec<f32>`.
56+
The **compute** phase is a plain scalar loop over `&[f32]` — no Arrow
57+
compute kernels, no scalar-function dispatch, no SIMD annotations.
58+
59+
This is a **compute-cost floor**, not a realistic parquet-on-DBMS
60+
baseline. It answers the question "what's the minimum cost you could
61+
get away with if you wrote a vector-search scan by hand with no query
62+
engine?" Real parquet users would pay substantially more (DuckDB
63+
`list_cosine_similarity`, DataFusion with a vector UDF, etc.) —
64+
adding those as additional baselines is a natural v2 direction.
65+
- `vortex-uncompressed` — Raw `Vector<dim, f32>` extension array, no
66+
encoding-level compression applied.
67+
- `vortex-default``BtrBlocksCompressor::default()` applied to the FSL
68+
storage child. On float vectors this typically finds ~15% lossless
69+
savings via ALP-RD (mantissa/exponent split + bitpacking).
70+
- `vortex-turboquant` — The full
71+
`L2Denorm(SorfTransform(FSL(Dict(codes, centroids))), norms)` pipeline.
72+
Lossy; recall@10 is reported alongside throughput. At the default 8-bit
73+
config this typically gives ~3× storage reduction at >90% top-10
74+
recall.
75+
76+
## Datasets
77+
78+
The smallest built-in dataset is **Cohere-100K** (`cohere-small`): 100K
79+
rows × 768 dims, cosine metric, ~150 MB zstd-parquet. It's the smallest
80+
VectorDBBench-supplied corpus that still exercises every encoding path.
81+
Larger variants (`cohere-medium`, `openai-small`, `openai-medium`,
82+
`bioasq-medium`, `glove-medium`) are wired up for local / on-demand
83+
experiments; see `vortex-bench/src/vector_dataset.rs` for the full list.
84+
85+
The upstream URL for Cohere-100K is
86+
`https://assets.zilliz.com/benchmark/cohere_small_100k/train.parquet`.
87+
The public Zilliz bucket is anonymous-readable so the code can hit it
88+
directly.
89+
90+
## Running locally
91+
92+
```bash
93+
cargo run -p vector-search-bench --release -- \
94+
--datasets cohere-small \
95+
--formats handrolled,vortex-uncompressed,vortex-default,vortex-turboquant \
96+
--iterations 5 \
97+
-d table
98+
```
99+
100+
The first run downloads the parquet file into
101+
`vortex-bench/data/cohere-small/cohere-small.parquet` and caches it
102+
idempotently for subsequent runs.
103+
104+
## CI note: dataset mirror
105+
106+
CI runs after every develop-branch merge. Hitting `assets.zilliz.com`
107+
from every merge would create recurring egress traffic on a third-party
108+
bucket — the same courtesy reason `RPlace` / `AirQuality` are excluded
109+
from CI in `compress-bench`.
110+
111+
Before enabling the `vector-search-bench` entry in `.github/workflows/bench.yml`
112+
on a fork, either:
113+
114+
1. **Mirror the file into an internal bucket** and swap the URL in
115+
`vortex-bench/src/vector_dataset.rs::VectorDataset::parquet_url`, or
116+
2. **Accept the upstream egress cost** and leave the URL as-is.
117+
118+
The mirror step is a one-off `aws s3 cp` and is documented here rather
119+
than automated in the build because the destination bucket is
120+
organization-specific.

0 commit comments

Comments
 (0)