Skip to content

Commit f5857b5

Browse files
committed
update v1.2.0-alpha.5
1 parent 7aaa047 commit f5857b5

3 files changed

Lines changed: 106 additions & 34 deletions

File tree

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
# enVector with ANN (GAS) in VectorDBBench
1+
# enVector in VectorDBBench
22

3-
The guide on how to use enVector with ANN index in VectorDBBench is available in [README_ENVECTOR.md](README_ENVECTOR.md).
3+
**Quick start:** The guide on how to use **enVector** in VectorDBBench is available in :
4+
5+
👉 [README_ENVECTOR.md](README_ENVECTOR.md).
46

57
The followings are the original contents of README in VectorDBBench:
68

README_ENVECTOR.md

Lines changed: 90 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# enVector with ANN (GAS) in VectorDBBench
1+
# enVector in VectorDBBench
22

3-
This guide demonstrates how to use enVector with an ANN index in VectorDBBench.
3+
This guide demonstrates how to use enVector in VectorDBBench.
44

55
Basic usage of enVector with VectorDBBench follows the standard procedure for [VectorDBBench](https://github.com/zilliztech/VectorDBBench).
66

@@ -18,7 +18,7 @@ Basic usage of enVector with VectorDBBench follows the standard procedure for [V
1818
│ ├── test.npy
1919
│ └── train.pkl
2020
├── README_ENVECTOR.md
21-
── scripts
21+
── scripts
2222
├── run_benchmark.sh # benchmark script
2323
├── envector_pubmed_config.yml # benchmark config file
2424
└── prepare_dataset.py # download and prepare ground truth neighbors for dataset
@@ -35,8 +35,8 @@ source .venv/bin/activate
3535
# 2. Install VectorDBBench
3636
pip install -e .
3737

38-
# 3. Install es2
39-
pip install es2==1.2.0a4
38+
# 3. Install pyenvector
39+
pip install pyenvector==1.2.0a5
4040
```
4141

4242
### Prepare dataset
@@ -48,8 +48,8 @@ Prepare the following artifacts for the ANN benchmark with `scripts/prepare_data
4848
- download centroids and tree metadata for the GAS index for corresponding to the embedding model
4949

5050
For the ANN benchmark, we provide two datasets via HuggingFace:
51-
- PUBMED768D400K: [cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m)
52-
- BLOOMBERG768D368K: [cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m)
51+
- `PUBMED768D400K`: [cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/pubmed-arxiv-abstract-embedding-gemma-300m)
52+
- `BLOOMBERG768D368K`: [cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m](https://huggingface.co/datasets/cryptolab-playground/Bloomberg-Financial-News-embedding-gemma-300m)
5353

5454
Also, we provide centroids and tree metadata for the corresponding embedding model used in the ANN benchmark:
5555
- GAS Centroids: [cryptolab-playground/gas-centroids](https://huggingface.co/datasets/cryptolab-playground/gas-centroids)
@@ -63,7 +63,7 @@ python ./scripts/prepare_dataset.py \
6363
-e embeddinggemma-300m
6464
```
6565

66-
Then, you can find the following generated files:
66+
Then, you can find the generated files as follows:
6767

6868
```bash
6969
.
@@ -91,25 +91,38 @@ cd envector-deployment/docker-compose
9191
```
9292

9393
We provide four enVector Docker Images:
94-
- `cryptolabinc/es2e:v1.2.0-alpha.4`
95-
- `cryptolabinc/es2b:v1.2.0-alpha.4`
96-
- `cryptolabinc/es2o:v1.2.0-alpha.4`
97-
- `cryptolabinc/es2c:v1.2.0-alpha.4`
94+
- `cryptolabinc/es2e:v1.2.0-alpha.5`
95+
- `cryptolabinc/es2b:v1.2.0-alpha.5`
96+
- `cryptolabinc/es2o:v1.2.0-alpha.5`
97+
- `cryptolabinc/es2c:v1.2.0-alpha.5`
9898

9999
### Set Environment Variables
100100

101101
```bash
102102
# Set environment variables
103-
export DATASET_LOCAL_DIR="./dataset"
104-
export NUM_PER_BATCH=4096
103+
export DATASET_LOCAL_DIR="./dataset" # dataset directory. default: /tmp/vectordb_bench/dataset
104+
export NUM_PER_BATCH=4096 # default batch size for enVector
105105
```
106106

107-
## Run Benchmark
107+
## Run Our ANN Benchmark
108+
109+
We provide enVector-customized ANN, called "GAS", designed to perform efficient IVF-FLAT-based ANN search with the encrypted index.
110+
We evaluated enVector on two benchmark datasets that we provided:
111+
- `PUBMED768D400K`
112+
- `BLOOMBERG768D368K`
113+
114+
Run the provided shell scripts (`./scripts/run_benchmark.sh`) as the following:
115+
116+
```bash
117+
./scripts/run_benchmark.sh --type flat # FLAT
118+
./scripts/run_benchmark.sh --type ivf # IVF-FLAT with enVector-customized ANN (GAS)
119+
```
120+
121+
For more details, please refer to `run_benchmark.sh` or `envector_{benchmark}_config.yml` in scripts directory for benchmarks with enVector with ANN (GAS), or you can use the following command:
108122

109-
Refer to `./scripts/run_benchmark.sh` or `./scripts/envector_benchmark_config.yml` for benchmarks with enVector with ANN (VCT), or use the following command:
110123

111124
```bash
112-
export NUM_PER_BATCH=500000 # set to the database size for efficiency with IVF_FLAT
125+
export NUM_PER_BATCH=500000 # set to the database size when IVF_FLAT
113126
python -m vectordb_bench.cli.vectordbbench envectorivfflat \
114127
--uri "localhost:50050" \
115128
--eval-mode mm \
@@ -123,10 +136,69 @@ python -m vectordb_bench.cli.vectordbbench envectorivfflat \
123136
--custom-dataset-file-count 1 \
124137
--custom-dataset-with-gt \
125138
--skip-custom-dataset-use-shuffled \
139+
--k 10 \
126140
--train-centroids True \
127141
--is-vct True \
128142
--centroids-path "./centroids/embeddinggemma-300m/centroids.npy" \
129143
--vct-path "./centroids/embeddinggemma-300m/tree_info.pkl" \
130144
--nlist 32768 \
131145
--nprobe 6
132-
```
146+
```
147+
148+
Note that, `NUM_PER_BATCH` is set to the database size when using IVF-based index for enVector.
149+
150+
## Run VectorDBBench Benchmark
151+
152+
Run the following commands to run enVector with VectorDBBench's built-in benchmark.
153+
154+
```bash
155+
# flat
156+
python -m vectordb_bench.cli.vectordbbench envectorflat \
157+
--uri "localhost:50050" \
158+
--case-type "Performance1536D500K" \
159+
--db-label "Performance1536D500K-FLAT"
160+
161+
# ivf: IVF-FLAT with random centroids
162+
export NUM_PER_BATCH=500000 # set database size when IVF-FLAT
163+
python -m vectordb_bench.cli.vectordbbench envectorivfflat \
164+
--uri "localhost:50050" \
165+
--case-type "Performance1536D500K" \
166+
--db-label "Performance1536D500K-IVF-FLAT" \
167+
--nlist 250 \
168+
--nprobe 6
169+
170+
# ivf-trained: IVF-FLAT with trained centroids via k-means
171+
export NUM_PER_BATCH=500000 # set to the database size when IVF-FLAT
172+
python -m vectordb_bench.cli.vectordbbench envectorivfflat \
173+
--uri "localhost:50050" \
174+
--case-type "Performance1536D500K" \
175+
--db-label "Performance1536D500K-IVF-FLAT" \
176+
--train-centroids True \
177+
--centroids-path "./centroids/kmeans_centroids.npy" \ # centroids built by sklearn, etc.
178+
--nlist 250 \
179+
--nprobe 6
180+
```
181+
182+
Note that, the benchmark provided by VectorDBBench, including Performance1536D500K, uses **unknown** embedding model (just notified as openai's one), we cannot use our GAS approach for ANN.
183+
184+
### CLI Options
185+
186+
enVector Types for VectorDBBench
187+
- `envectorflat`: FLAT as index type for enVector
188+
- `envectorivfflat`: IVF_FLAT as index type for enVector
189+
190+
Common Options for enVector
191+
- `--uri`: enVector server URI
192+
- `--eval-mode`: FHE evaluation mode on server. Use `mm` for enhanced performance.
193+
194+
ANN Options for enVector
195+
- `--nlist`: Number of coarse clusters for IVF_FLAT
196+
- `--nprobe`: Number of clusters to scan during search for IVF_FLAT
197+
- `--train-centroids`: whether to use trained centroids for IVF_FLAT
198+
- `--centroids-path`: path to the trained centroids
199+
- `--is-vct`: whether to use VCT approach for IVF_GAS
200+
- `--vct-path`: path to the trained VCT metadata for IVF_GAS
201+
202+
Benchmark Options:
203+
follows conventions of VectorDBBench,
204+
see details in [VectorDBBench Options](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file#custom-dataset-for-performance-case)

vectordb_bench/backend/clients/envector/envector.py

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
from pathlib import Path
88
from typing import Any
99

10-
import es2
1110
import numpy as np
11+
import pyenvector as ev
1212

1313
from vectordb_bench.backend.filter import Filter, FilterOp
1414

@@ -51,32 +51,32 @@ def __init__(
5151
self._vector_index_name = "vector_idx"
5252
self._scalar_id_index_name = "id_sort_idx"
5353
self._scalar_labels_index_name = "labels_idx"
54-
self.col: es2.Index | None = None
54+
self.col: ev.Index | None = None
5555

5656
self.is_vct: bool = False
5757
self.vct_params: dict[str, Any] = {}
5858

59-
es2.init(
59+
ev.init(
6060
address=self.db_config.get("uri"),
6161
key_path=self.db_config.get("key_path"),
6262
key_id=self.db_config.get("key_id"),
6363
eval_mode=self.case_config.eval_mode,
6464
)
6565
if drop_old:
6666
log.info(f"{self.name} client drop_old index: {self.collection_name}")
67-
if self.collection_name in es2.get_index_list():
68-
es2.drop_index(self.collection_name)
67+
if self.collection_name in ev.get_index_list():
68+
ev.drop_index(self.collection_name)
6969

7070
# Create the collection
7171
log.info(f"{self.name} create index: {self.collection_name}")
7272

7373
index_kwargs = dict(kwargs)
7474
self._ensure_index(dim, index_kwargs)
7575

76-
es2.disconnect()
76+
ev.disconnect()
7777

7878
def _ensure_index(self, dim: int, index_kwargs: dict[str, Any]):
79-
if self.collection_name in es2.get_index_list():
79+
if self.collection_name in ev.get_index_list():
8080
log.info(f"{self.name} index {self.collection_name} already exists, skip creating")
8181
self.is_vct = self.case_config.index_param().get("is_vct", False)
8282
log.debug(f"IS_VCT: {self.is_vct}")
@@ -94,7 +94,7 @@ def _create_index(self, dim: int, index_kwargs: dict[str, Any]):
9494
if index_type == "IVF_FLAT":
9595
self._adjust_batch_size()
9696

97-
es2.create_index(
97+
ev.create_index(
9898
index_name=self.collection_name,
9999
dim=dim,
100100
key_path=self.db_config.get("key_path"),
@@ -146,24 +146,24 @@ def init(self):
146146
>>> self.insert_embeddings()
147147
>>> self.search_embedding()
148148
"""
149-
es2.init(
149+
ev.init(
150150
address=self.db_config.get("uri"),
151151
key_path=self.db_config.get("key_path"),
152152
key_id=self.db_config.get("key_id"),
153153
eval_mode=self.case_config.eval_mode,
154154
)
155155
try:
156-
self.col = es2.Index(self.collection_name)
156+
self.col = ev.Index(self.collection_name)
157157
if self.is_vct:
158-
log.debug(f"VCT: {self.col.index_config.index_param.index_params['virtual_cluster']}")
158+
log.debug(f"VCT: {self.col.index_config.index_param.index_params.get('virtual_cluster')}")
159159
is_vct = self.case_config.index_param().get("is_vct", False)
160160
assert self.is_vct == is_vct, "is_vct mismatch"
161161
vct_path = self.case_config.index_param().get("vct_path", None)
162162
self.col._load_virtual_cluster_from_pkl(vct_path)
163163
yield
164164
finally:
165165
self.col = None
166-
es2.disconnect()
166+
ev.disconnect()
167167

168168
def create_index(self):
169169
pass
@@ -194,8 +194,6 @@ def insert_embeddings(
194194
assert self.col is not None
195195
assert len(embeddings) == len(metadata)
196196

197-
log.debug(f"IS_VCT: {self.is_vct}")
198-
199197
insert_count = 0
200198
try:
201199
for batch_start_offset in range(0, len(embeddings), self.batch_size):

0 commit comments

Comments
 (0)