Commit bff43dc
authored
Vector datasets catalog and downloader (#7446)
## Summary
Tracking issue: #7297
We will want to add vector benchmarking soon (see
#7399 for a draft).
This adds a simple catalog for the vector datasets hosted by
`https://assets.zilliz.com/benchmark` for
[VectorDBBench](https://github.com/zilliztech/vectordbbench), which both
describes the shape of the datasets (are things partitioned, randomly
shuffled, are there neighbors lists for top k, etc).
Also handles downloading everything.
I had to verify that all of this stuff was correct by looking at the S3
buckets themselves:
```sh
aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request
```
<details>
```sh
for d in bioasq_large_10m bioasq_medium_1m cohere_large_10m cohere_medium_1m \
cohere_small_100k gist_medium_1m gist_small_100k glove_medium_1m \
glove_small_100k laion_large_100m \
openai_large_5m openai_medium_500k openai_small_50k \
sift_large_50m sift_medium_5m sift_small_500k; do
echo "=== $d ==="
aws s3 ls s3://assets.zilliz.com/benchmark/$d/ --region us-west-2 --no-sign-request
done
```
</details>
And this script from the main repo helped too:
https://github.com/zilliztech/VectorDBBench/blob/main/vectordb_bench/backend/dataset.py
---
Things that are not implemented that I would like to add:
- Is the dataset pre-normalized for cosine similarity? This is not so
obvious to me without actually working with the datasets, so I will do
this later.
- Some datasets have scalar labels for all vectors that help mimic
similarity + filter by some other column. Some of them also have
neighbor lists for these specific filtered queries. So that is something
we'll probably want to add in the future.
## Testing
N/A
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>1 parent 9406303 commit bff43dc
6 files changed
Lines changed: 970 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
0 commit comments