Skip to content

Commit b6ea352

Browse files
claudeconnortsui20
authored andcommitted
add recall@K vs neighbors.parquet + README rewrite
src/recall.rs per-flavor recall driver: samples N test rows, runs brute-force top-K cosine over every shard via a bounded BinaryHeap, compares against the neighbors.parquet ground truth, reports mean + p05 recall src/main.rs --recall / --recall-k / --recall-queries / --recall-seed flags; bails when the dataset has no neighbors hosted; skips lossless flavors (trivially 1.0) src/display.rs extra recall@K (mean) and (p05) rows, only emitted when --recall produced results tests/recall_smoke.rs 8-row standard-basis dataset where train row i is basis e_i and neighbors_id[i] = i. Lossless flavor must hit recall@1 = 1.0. README is fully rewritten to reflect the new on-disk file-scan benchmark, the layout / partitioned model, the f32-only pipeline, and the future-work backlog. Signed-off-by: Claude <noreply@anthropic.com> Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 4640234 commit b6ea352

7 files changed

Lines changed: 777 additions & 114 deletions

File tree

Lines changed: 95 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -1,120 +1,108 @@
11
# vector-search-bench
22

3-
Brute-force cosine-similarity benchmark for Vortex on public VectorDBBench
4-
embedding corpora.
3+
On-disk cosine-similarity scan benchmark for Vortex on public VectorDBBench
4+
embedding corpora. The benchmark writes one `.vortex` file per train shard per
5+
flavor and then issues filtered scans against the resulting files, so the
6+
numbers reflect realistic out-of-memory workloads — not in-memory `ArrayRef`
7+
manipulation.
58

6-
The current benchmark pipeline supports source embedding columns with `f32`
7-
or `f64` elements. The lower-level `list_to_vector_ext` conversion helper can
8-
rewrap `f16` lists as `Vector` extension arrays, but `vector-search-bench`
9-
itself does not yet support `f16` query extraction or the hand-rolled parquet
10-
baseline.
11-
12-
## What it measures
13-
14-
For each `(dataset, format)` pair, the benchmark records:
15-
16-
1. **`nbytes`** — in-memory footprint of the variant's array tree, in bytes.
17-
Reporting the in-memory `.nbytes()` instead of an on-disk file size is
18-
deliberate: the Vortex default write path runs BtrBlocks on every tree
19-
regardless of whether it's already compressed, so "on-disk size" would
20-
collapse `vortex-uncompressed` and `vortex-default` to the same bytes
21-
even though their in-memory trees are different. The `nbytes()`
22-
number is consistent with what the *compute* measurements actually
23-
operate on.
24-
- The `handrolled` baseline reports the canonical parquet file size
25-
on disk — that's the only encoded representation it has.
26-
2. **Compress time** — wall time to build the variant tree from the
27-
materialized uncompressed source. ~0 for `vortex-uncompressed` (identity),
28-
meaningful for the two compressed variants.
29-
3. **Decompress time** — wall time to execute the variant tree all the way
30-
back into a canonical `FixedSizeListArray<f32>` with a materialized f32
31-
element buffer. For `vortex-uncompressed` this is a no-op; for
32-
`vortex-default` it includes ALP-RD bit-unpacking; for
33-
`vortex-turboquant` it includes the inverse SORF rotation and
34-
dictionary lookup.
35-
4. **Cosine-similarity time**`CosineSimilarity(data, const_query)`
36-
executed to a materialized f32 array.
37-
5. **Cosine-filter time**`Binary(Gt, [CosineSimilarity, threshold])`
38-
executed to a `BoolArray`.
39-
6. **Recall@10** (TurboQuant only) — the fraction of the exact top-10
40-
nearest neighbours that TurboQuant recovers, using the uncompressed
41-
Vortex scan as local ground truth.
42-
43-
Before any timing starts, the benchmark runs a **correctness verification
44-
pass**: cosine scores for a single query are computed against every
45-
variant and compared to the uncompressed baseline. Lossless variants must
46-
match within `1e-4` max-abs-diff; TurboQuant must stay within `0.2`. A
47-
mismatch bails the run — you cannot publish throughput numbers for a
48-
variant that returns wrong answers.
49-
50-
## Formats
51-
52-
- `handrolled` — Hand-rolled Rust scalar cosine loop over a flat
53-
`Vec<f32>` that was decoded from the canonical parquet file via
54-
`parquet-rs` / `arrow-rs`. The **decompress** phase does the parquet
55-
read, downcasts to `Float32Array`, and memcpies into a plain `Vec<f32>`.
56-
The **compute** phase is a plain scalar loop over `&[f32]` — no Arrow
57-
compute kernels, no scalar-function dispatch, no SIMD annotations.
58-
59-
This is a **compute-cost floor**, not a realistic parquet-on-DBMS
60-
baseline. It answers the question "what's the minimum cost you could
61-
get away with if you wrote a vector-search scan by hand with no query
62-
engine?" Real parquet users would pay substantially more (DuckDB
63-
`list_cosine_similarity`, DataFusion with a vector UDF, etc.) —
64-
adding those as additional baselines is a natural v2 direction.
65-
- `vortex-uncompressed` — Raw `Vector<dim, f32>` extension array, no
66-
encoding-level compression applied.
67-
- `vortex-default``BtrBlocksCompressor::default()` applied to the FSL
68-
storage child. On float vectors this typically finds ~15% lossless
69-
savings via ALP-RD (mantissa/exponent split + bitpacking).
70-
- `vortex-turboquant` — The full
71-
`L2Denorm(SorfTransform(FSL(Dict(codes, centroids))), norms)` pipeline.
72-
Lossy; recall@10 is reported alongside throughput. At the default 8-bit
73-
config this typically gives ~3× storage reduction at >90% top-10
74-
recall.
75-
76-
## Datasets
77-
78-
The smallest built-in dataset is **Cohere-100K** (`cohere-small`): 100K
79-
rows × 768 dims, cosine metric, ~150 MB zstd-parquet. It's the smallest
80-
VectorDBBench-supplied corpus that still exercises every encoding path.
81-
Larger variants (`cohere-medium`, `openai-small`, `openai-medium`,
82-
`bioasq-medium`, `glove-medium`) are wired up for local / on-demand
83-
experiments; see `vortex-bench/src/vector_dataset.rs` for the full list.
84-
85-
The upstream URL for Cohere-100K is
86-
`https://assets.zilliz.com/benchmark/cohere_small_100k/train.parquet`.
87-
The public Zilliz bucket is anonymous-readable so the code can hit it
88-
directly.
89-
90-
## Running locally
9+
## Quick start
9110

9211
```bash
9312
cargo run -p vector-search-bench --release -- \
94-
--datasets cohere-small \
95-
--formats handrolled,vortex-uncompressed,vortex-default,vortex-turboquant \
96-
--iterations 5 \
97-
-d table
13+
--dataset cohere-small-100k \
14+
--flavors vortex-uncompressed,vortex-turboquant,handrolled \
15+
--iterations 3 \
16+
--threshold 0.8
9817
```
9918

100-
The first run downloads the parquet file into
101-
`vortex-bench/data/cohere-small/cohere-small.parquet` and caches it
102-
idempotently for subsequent runs.
19+
The first run downloads the parquet shards into
20+
`vortex-bench/data/vector-search/<dataset>/<layout>/train/...`, ingests them
21+
into per-flavor `.vortex` files in sibling directories, samples a query row
22+
from `test.parquet`, and runs the timed scan loop.
10323

104-
## CI note: dataset mirror
24+
A datasets that publishes more than one layout (e.g. `cohere-large-10m`
25+
hosts both `partitioned` and `partitioned-shuffled`) requires `--layout` to
26+
disambiguate.
10527

106-
CI runs after every develop-branch merge. Hitting `assets.zilliz.com`
107-
from every merge would create recurring egress traffic on a third-party
108-
bucket — the same courtesy reason `RPlace` / `AirQuality` are excluded
109-
from CI in `compress-bench`.
28+
## What it measures
11029

111-
Before enabling the `vector-search-bench` entry in `.github/workflows/bench.yml`
112-
on a fork, either:
30+
Per `(dataset, flavor)`:
31+
32+
| Metric | What it is |
33+
|---------------------|---------------------------------------------------------|
34+
| compress wall | Sum of per-shard write time (parquet → `.vortex`). |
35+
| input bytes | Sum of input parquet shard sizes. |
36+
| output bytes | Sum of output `.vortex` shard sizes. |
37+
| compression ratio | input bytes / output bytes. |
38+
| scan wall (best) | Best-of-N wall-clock for the per-iteration scan. |
39+
| scan wall (median) | Median wall-clock for the per-iteration scan. |
40+
| matches | Rows that survived `cosine(emb, query) > threshold`. |
41+
| rows scanned | Total rows in the `.vortex` files (sanity check). |
42+
| rows / sec | rows scanned / scan wall (best). |
43+
| recall@K (mean/p05) | Only emitted when `--recall` is passed (lossy flavors). |
44+
45+
## Flavors
46+
47+
- **`vortex-uncompressed`**`BtrBlocksCompressorBuilder::empty()`. Vortex
48+
framing with no compression schemes registered, so the `emb` column lands
49+
as canonical `FixedSizeList<f32>` on disk. Lossless ceiling on the size
50+
axis.
51+
- **`vortex-turboquant`**`BtrBlocksCompressorBuilder::empty().with_turboquant()`.
52+
Only the TurboQuant scheme is registered, so the `emb` column ends up
53+
wrapped as `L2Denorm(SorfTransform(FixedSizeList(Dict)))`. Lossy; significant
54+
size win.
55+
- **`handrolled`** — Sequential parquet scan + 4-way unrolled scalar cosine
56+
loop over a flat `Vec<f32>` (decoded via `parquet-rs` / `arrow-rs`). This
57+
is a *compute-cost floor*, not a realistic parquet-on-DBMS baseline. Real
58+
parquet users would pay substantially more (DuckDB
59+
`list_cosine_similarity`, DataFusion with a vector UDF, etc.) — adding
60+
those as additional baselines is a natural future direction.
61+
62+
The benchmark always operates in `f32`. The ingest pipeline casts `f64`
63+
sources (e.g. OpenAI corpora) to `f32` once at write time, so all downstream
64+
code is uniformly `f32`.
11365

114-
1. **Mirror the file into an internal bucket** and swap the URL in
115-
`vortex-bench/src/vector_dataset.rs::VectorDataset::parquet_url`, or
116-
2. **Accept the upstream egress cost** and leave the URL as-is.
66+
## Datasets
11767

118-
The mirror step is a one-off `aws s3 cp` and is documented here rather
119-
than automated in the build because the destination bucket is
120-
organization-specific.
68+
All 16 published VectorDBBench corpora are wired into the catalog, with
69+
explicit declarations of which train-split layouts upstream actually hosts.
70+
See `vortex-bench/src/vector_dataset/catalog.rs` for the full table. CLI
71+
helpfully lists choices when run with `--help`.
72+
73+
| Dataset | dim | rows | layouts |
74+
|--------------------|------|------|---------------------------------------------|
75+
| cohere-small-100k | 768 | 100K | single, single-shuffled |
76+
| cohere-medium-1m | 768 | 1M | single, single-shuffled |
77+
| cohere-large-10m | 768 | 10M | partitioned (10), partitioned-shuffled (10) |
78+
| openai-small-50k | 1536 | 50K | single, single-shuffled |
79+
| openai-medium-500k | 1536 | 500K | single, single-shuffled |
80+
| openai-large-5m | 1536 | 5M | partitioned (10), partitioned-shuffled (10) |
81+
| bioasq-medium-1m | 1024 | 1M | single-shuffled |
82+
| bioasq-large-10m | 1024 | 10M | partitioned-shuffled (10) |
83+
| glove-{small,medium}, gist-{small,medium} | varies | varies | single only |
84+
| sift-small-500k | 128 | 500K | single |
85+
| sift-medium-5m | 128 | 5M | single |
86+
| sift-large-50m | 128 | 50M | partitioned (50) |
87+
| laion-large-100m | 768 | 100M | partitioned (100) |
88+
89+
## Recall@K
90+
91+
Pass `--recall --recall-k 10 --recall-queries 100` to measure recall against
92+
`neighbors.parquet`. The lossless `vortex-uncompressed` flavor is skipped
93+
because its recall is 1.0 by construction; only `vortex-turboquant` is
94+
measured. Datasets that don't host `neighbors.parquet` (sift, glove, gist)
95+
bail out when `--recall` is set.
96+
97+
## Future work
98+
99+
1. Native `f64` flavor — drop the prepare-time downcast for OpenAI datasets.
100+
2. `--decompress-only` mode — project + drain, no filter — for pure decode
101+
timing.
102+
3. Filtered scans via `scalar_labels` (already projected through the ingest
103+
pipeline; the `neighbors_int_*p.parquet` and `neighbors_labels_*.parquet`
104+
ground-truth files exist for verification).
105+
4. DuckDB / DataFusion parquet baselines — real engines, not just hand-rolled.
106+
5. MSE-vs-ground-truth correctness mode (catches "right top-K, wrong scores").
107+
6. Promote the cosine-filter expression helpers from `expression.rs` into
108+
`vortex-tensor::vector_search` if a second caller materializes.

benchmarks/vector-search-bench/src/display.rs

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,16 @@ use tabled::settings::Style;
3030
use crate::compression::VortexCompression;
3131
use crate::handrolled::HandrolledTiming;
3232
use crate::prepare::CompressionResult;
33+
use crate::recall::RecallResult;
3334
use crate::scan::ScanTiming;
3435

3536
/// Final column-per-flavor row set for one dataset.
3637
pub struct DatasetReport<'a> {
3738
pub dataset_name: &'a str,
3839
pub vortex_results: &'a [(VortexCompression, &'a CompressionResult, &'a ScanTiming)],
3940
pub handrolled: Option<&'a HandrolledTiming>,
41+
/// Per-flavor recall results when `--recall` was requested. Empty otherwise.
42+
pub recall: &'a [RecallResult],
4043
}
4144

4245
/// Render the full report into the given writer as a tabled table.
@@ -115,6 +118,36 @@ pub fn render(report: &DatasetReport<'_>, writer: &mut dyn Write) -> Result<()>
115118
|h| format_throughput_rows(h.rows_scanned, h.best_of),
116119
));
117120

121+
if !report.recall.is_empty() {
122+
let k = report.recall[0].k;
123+
rows.push(make_row(
124+
&format!("recall@{k} (mean)"),
125+
report,
126+
|flavor, _, _| {
127+
report
128+
.recall
129+
.iter()
130+
.find(|r| r.flavor == flavor)
131+
.map(|r| format!("{:.3}", r.mean_recall))
132+
.unwrap_or_else(|| "—".to_owned())
133+
},
134+
|_| "—".to_owned(),
135+
));
136+
rows.push(make_row(
137+
&format!("recall@{k} (p05)"),
138+
report,
139+
|flavor, _, _| {
140+
report
141+
.recall
142+
.iter()
143+
.find(|r| r.flavor == flavor)
144+
.map(|r| format!("{:.3}", r.p05_recall))
145+
.unwrap_or_else(|| "—".to_owned())
146+
},
147+
|_| "—".to_owned(),
148+
));
149+
}
150+
118151
writeln!(writer, "## {}", report.dataset_name)?;
119152
let mut builder = tabled::builder::Builder::new();
120153
builder.push_record(headers);

benchmarks/vector-search-bench/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ pub mod ingest;
1313
pub mod paths;
1414
pub mod prepare;
1515
pub mod query;
16+
pub mod recall;
1617
pub mod scan;
1718
pub mod scan_util;
1819
pub mod session;

benchmarks/vector-search-bench/src/main.rs

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ use vector_search_bench::handrolled::run_handrolled_scan;
2525
use vector_search_bench::prepare::CompressionResult;
2626
use vector_search_bench::prepare::prepare_all;
2727
use vector_search_bench::query::sample_query;
28+
use vector_search_bench::recall::RecallConfig;
29+
use vector_search_bench::recall::RecallResult;
30+
use vector_search_bench::recall::measure_recall;
2831
use vector_search_bench::scan::ScanConfig;
2932
use vector_search_bench::scan::ScanTiming;
3033
use vector_search_bench::scan::run_scan;
@@ -68,6 +71,25 @@ struct Args {
6871
#[arg(long, default_value_t = 42)]
6972
query_seed: u64,
7073

74+
/// Measure Recall@K for lossy flavors against `neighbors.parquet`. Bails if the
75+
/// dataset doesn't host neighbors.
76+
#[arg(long, default_value_t = false)]
77+
recall: bool,
78+
79+
/// Number of query rows sampled when computing Recall@K. Distinct from --query-seed
80+
/// so the recall sampler can pick a different seeded set.
81+
#[arg(long, default_value_t = 100, value_parser = parse_positive_usize)]
82+
recall_queries: usize,
83+
84+
/// K in Recall@K. Defaults to 10 (matches VectorDBBench convention).
85+
#[arg(long, default_value_t = 10, value_parser = parse_positive_usize)]
86+
recall_k: usize,
87+
88+
/// Seed for the recall query sampler. Distinct from --query-seed so the throughput
89+
/// scan and the recall pass can pick non-correlated query sets.
90+
#[arg(long, default_value_t = 1234)]
91+
recall_seed: u64,
92+
7193
/// Optional path to write the rendered table to instead of stdout.
7294
#[arg(long)]
7395
output_path: Option<PathBuf>,
@@ -178,6 +200,50 @@ async fn main() -> Result<()> {
178200
})
179201
.transpose()?;
180202

203+
let recall_results = if args.recall {
204+
let neighbors_path = paths.neighbors.as_ref().with_context(|| {
205+
format!(
206+
"--recall requested but dataset {} does not host neighbors.parquet",
207+
dataset.name()
208+
)
209+
})?;
210+
let recall_config = RecallConfig {
211+
k: args.recall_k,
212+
num_queries: args.recall_queries,
213+
query_seed: args.recall_seed,
214+
};
215+
let mut out: Vec<RecallResult> = Vec::with_capacity(prepared.len());
216+
for prep in &prepared {
217+
// Lossless flavors are trivially 1.0; only TurboQuant needs measurement.
218+
if prep.flavor == VortexCompression::Uncompressed {
219+
tracing::info!(
220+
"skipping recall for lossless flavor {} (trivially 1.0)",
221+
prep.flavor.label()
222+
);
223+
continue;
224+
}
225+
let r = measure_recall(
226+
prep,
227+
&paths.test,
228+
neighbors_path,
229+
dataset.element_ptype(),
230+
&recall_config,
231+
)
232+
.await?;
233+
tracing::info!(
234+
"recall@{} for {}: mean={:.4}, p05={:.4}",
235+
r.k,
236+
r.flavor.label(),
237+
r.mean_recall,
238+
r.p05_recall,
239+
);
240+
out.push(r);
241+
}
242+
out
243+
} else {
244+
Vec::new()
245+
};
246+
181247
let pairs: Vec<(VortexCompression, &CompressionResult, &ScanTiming)> = prepared
182248
.iter()
183249
.zip(scan_timings.iter())
@@ -187,6 +253,7 @@ async fn main() -> Result<()> {
187253
dataset_name: dataset.name(),
188254
vortex_results: &pairs,
189255
handrolled: handrolled_timing.as_ref(),
256+
recall: &recall_results,
190257
};
191258

192259
if let Some(path) = args.output_path {

0 commit comments

Comments
 (0)