Skip to content

Commit bff43dc

Browse files
authored
Vector datasets catalog and downloader (#7446)
## Summary Tracking issue: #7297 We will want to add vector benchmarking soon (see #7399 for a draft). This adds a simple catalog for the vector datasets hosted by `https://assets.zilliz.com/benchmark` for [VectorDBBench](https://github.com/zilliztech/vectordbbench), which both describes the shape of the datasets (are things partitioned, randomly shuffled, are there neighbors lists for top k, etc). Also handles downloading everything. I had to verify that all of this stuff was correct by looking at the S3 buckets themselves: ```sh aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request ``` <details> ```sh for d in bioasq_large_10m bioasq_medium_1m cohere_large_10m cohere_medium_1m \ cohere_small_100k gist_medium_1m gist_small_100k glove_medium_1m \ glove_small_100k laion_large_100m \ openai_large_5m openai_medium_500k openai_small_50k \ sift_large_50m sift_medium_5m sift_small_500k; do echo "=== $d ===" aws s3 ls s3://assets.zilliz.com/benchmark/$d/ --region us-west-2 --no-sign-request done ``` </details> And this script from the main repo helped too: https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py --- Things that are not implemented that I would like to add: - Is the dataset pre-normalized for cosine similarity? This is not so obvious to me without actually working with the datasets, so I will do this later. - Some datasets have scalar labels for all vectors that help mimic similarity + filter by some other column. Some of them also have neighbor lists for these specific filtered queries. So that is something we'll probably want to add in the future. ## Testing N/A Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 9406303 commit bff43dc

6 files changed

Lines changed: 970 additions & 0 deletions

File tree

vortex-bench/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ pub mod statpopgen;
5353
pub mod tpcds;
5454
pub mod tpch;
5555
pub mod utils;
56+
pub mod vector_dataset;
5657

5758
pub use benchmark::Benchmark;
5859
pub use benchmark::TableSpec;

0 commit comments

Comments
 (0)