Skip to content

Commit 99416bc

Browse files
committed
feat(vector): Enhance vector indexing with Product Quantization (PQ) support
- Introduced new quantization configurations for vector indexing, including INT8, BINARY, and various PQ settings. - Updated benchmark scripts to evaluate the accuracy of different quantization methods. - Added functionality to create vector indices with PQ parameters, ensuring metadata reflects these settings. - Implemented approximate search capabilities using PQ, allowing for efficient nearest neighbor searches. - Enhanced database interaction by allowing immediate graph rebuilding without waiting for search triggers. - Updated tests to cover new PQ features, ensuring proper functionality and performance. - Adjusted memory settings for JVM based on dataset size to optimize performance. - Improved documentation and markdown summaries to reflect changes in vector indexing and benchmarks.
1 parent d1d2fe1 commit 99416bc

14 files changed

Lines changed: 726 additions & 148 deletions

bindings/python/docs/api/database.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -554,7 +554,14 @@ db.create_vector_index(
554554
dimensions: int,
555555
distance_function: str = "cosine",
556556
max_connections: int = 32,
557-
beam_width: int = 256
557+
beam_width: int = 256,
558+
quantization: str | None = None,
559+
store_vectors_in_graph: bool = False,
560+
add_hierarchy: bool | None = None,
561+
pq_subspaces: int | None = None,
562+
pq_clusters: int | None = None,
563+
pq_center_globally: bool | None = None,
564+
pq_training_limit: int | None = None,
558565
) -> VectorIndex
559566
```
560567

@@ -568,6 +575,13 @@ Create a vector index for similarity search (JVector implementation). Existing r
568575
- `distance_function` (str): `"cosine"`, `"euclidean"`, or `"inner_product"`
569576
- `max_connections` (int): Max connections per node (default: 32). Maps to `maxConnections` in HNSW (JVector).
570577
- `beam_width` (int): Beam width for search/construction (default: 256). Maps to `beamWidth` in HNSW (JVector).
578+
- `quantization` (str | None): `"INT8"`, `"BINARY"`, or `"PRODUCT"` for PQ (default: None).
579+
- `store_vectors_in_graph` (bool): Persist vectors inline in graph file (faster reopen/search, larger graph).
580+
- `add_hierarchy` (bool | None): Force enabling/disabling HNSW hierarchy; None uses engine default.
581+
- `pq_subspaces` (int | None): PQ subspaces (M). Requires `quantization="PRODUCT"`.
582+
- `pq_clusters` (int | None): PQ clusters per subspace (K). Requires `quantization="PRODUCT"`.
583+
- `pq_center_globally` (bool | None): PQ global centering flag. Requires `quantization="PRODUCT"`.
584+
- `pq_training_limit` (int | None): PQ training sample cap. Requires `quantization="PRODUCT"`.
571585

572586
**Returns:**
573587

bindings/python/docs/guide/vectors.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,11 @@ with arcadedb.create_database("./vector_demo") as db:
2727
distance_function="cosine", # default: cosine
2828
max_connections=32, # Corresponds to M in HNSW
2929
beam_width=256 # Corresponds to efConstruction in HNSW
30+
# quantization="PRODUCT", # enable PQ
31+
# pq_subspaces=2, # M
32+
# pq_clusters=256, # K
33+
# pq_center_globally=True,
34+
# pq_training_limit=128_000,
3035
)
3136

3237
with db.transaction():
@@ -45,7 +50,7 @@ with arcadedb.create_database("./vector_demo") as db:
4550
## API Essentials
4651

4752
- Vector property type must be `ARRAY_OF_FLOATS`.
48-
- `create_vector_index(vertex_type, vector_property, dimensions, distance_function="cosine", max_connections=32, beam_width=256, quantization=None)`
53+
- `create_vector_index(vertex_type, vector_property, dimensions, distance_function="cosine", max_connections=32, beam_width=256, quantization=None, store_vectors_in_graph=False, add_hierarchy=None, pq_subspaces=None, pq_clusters=None, pq_center_globally=None, pq_training_limit=None)`
4954
- `find_nearest(query_vector, k=10, overquery_factor=16, allowed_rids=None)`
5055
- `overquery_factor` multiplies `k` during search to improve recall.
5156
- `allowed_rids` filters candidates server-side (useful for metadata-prefilter).
@@ -122,8 +127,9 @@ results = index.find_nearest(query_vec, k=5, allowed_rids=rids)
122127

123128
## Quantization (Experimental)
124129

125-
- `quantization` accepts `"INT8"` or `"BINARY"`.
126-
- Current ArcadeDB builds have instability with INT8/BINARY on larger inserts (see tests); keep the default (`None`) unless benchmarking small datasets.
130+
- `quantization` accepts `"INT8"`, `"BINARY"`, or `"PRODUCT"` (PQ).
131+
- PQ tunables (require `quantization="PRODUCT"`): `pq_subspaces` (M), `pq_clusters` (K), `pq_center_globally`, `pq_training_limit`.
132+
- INT8/BINARY have shown instability on larger inserts; PQ is the recommended quantization path for recall/latency trade-offs.
127133

128134
## SQL Helpers (Optional)
129135

bindings/python/examples/benchmark-vector/README.md

Lines changed: 42 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -37,36 +37,34 @@
3737
##### `quantization=none`
3838

3939
```bash
40-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=10000_seed=42/VectorData_0.1.65536.v0.bucket
41-
10M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748779662794320.4.262144.v0.lsmvecidx
42-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748779662794320_vecgraph.5.262144.v0.vecgraph
40+
5.6G VectorData_0.1.65536.v0.bucket
41+
10M VectorData_0_2748779662794320.4.262144.v0.lsmvecidx
42+
59M VectorData_0_2748779662794320_vecgraph.5.262144.v0.vecgraph
4343
```
4444

4545
##### `quantization=int8`
4646

4747
```bash
48-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=10000_seed=42/VectorData_0.1.65536.v0.bucket
49-
999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748780028226180.4.262144.v0.lsmvecidx
50-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748780028226180_vecgraph.5.262144.v0.vecgraph
48+
5.6G VectorData_0.1.65536.v0.bucket
49+
999M VectorData_0_2748780028226180.4.262144.v0.lsmvecidx
50+
59M VectorData_0_2748780028226180_vecgraph.5.262144.v0.vecgraph
5151
```
5252

53-
54-
5553
##### `quantization=PQ`
5654

5755
```bash
58-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=product_store=off_hier=on_batch=10000_seed=42/VectorData_0.1.65536.v0.bucket
59-
11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=product_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748780503246723.4.262144.v0.lsmvecidx
60-
247M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=product_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748780503246723.4.262144.v0.lsmvecidx.vecpq
61-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=product_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748780503246723_vecgraph.5.262144.v0.vecgraph
56+
5.6G VectorData_0.1.65536.v0.bucket
57+
11M VectorData_0_2748780503246723.4.262144.v0.lsmvecidx
58+
247M VectorData_0_2748780503246723.4.262144.v0.lsmvecidx.vecpq
59+
59M VectorData_0_2748780503246723_vecgraph.5.262144.v0.vecgraph
6260
```
6361

6462
##### `quantization=binary`
6563

6664
```bash
67-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=binary_store=off_hier=on_batch=10000_seed=42/VectorData_0.1.65536.v0.bucket
68-
140M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=binary_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748779801023527.4.262144.v0.lsmvecidx
69-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=binary_store=off_hier=on_batch=10000_seed=42/VectorData_0_2748779801023527_vecgraph.5.262144.v0.vecgraph
65+
5.6G VectorData_0.1.65536.v0.bucket
66+
140M VectorData_0_2748779801023527.4.262144.v0.lsmvecidx
67+
59M VectorData_0_2748779801023527_vecgraph.5.262144.v0.vecgraph
7068
```
7169

7270
#### Findings
@@ -83,7 +81,19 @@
8381
- `BINARY` index is moderate (~140 MB).
8482
- **Reopen:** Recall and timings after reopen track pre-close numbers; PQ remains lower recall, `NONE`/`INT8` remain high.
8583
- **Recommendation:** For quality, prefer `NONE` or `INT8`; use PQ only if you need the lowest query latency and can accept lower recall, and consider tuning PQ (M/K) to recover recall. Avoid `BINARY` here given the large recall drop.
86-
- All four of them saved the vectors in the db like `db.schema.get_or_create_property("VectorData", "vector", "ARRAY_OF_FLOATS")`. Maybe we should do this differently when quantization is enabled?
84+
- All four of them saved the vectors in the db like `db.schema.get_or_create_property("VectorData", "vector", "ARRAY_OF_FLOATS")`.
85+
- Maybe we should do this differently when quantization is enabled?
86+
87+
#### MSMARCO-1M (1000 queries, Recall@50) with heap size capped at 4GB
88+
89+
- For 1M dataset, I've been doing 8GB heap so far. This time I capped it at 4GB to see how it affects performance.
90+
91+
| quantization | ingest_s | ingest_rss_mb | create_index_s | create_index_rss_mb | build_graph_now_s | build_graph_now_rss_mb | search_s | recall@50_before_close | search_after_reopen_s | search_after_reopen_rss_mb | recall@50_after_reopen | peak_rss_mb | db_size_mb | total_duration |
92+
| :----------- | -------: | ------------: | -------------: | ------------------: | ----------------: | ---------------------: | -------: | ---------------------: | --------------------: | -------------------------: | ---------------------: | ----------: | ---------: | :------------- |
93+
| NONE | 62.175 | 4004.79 | 16.29 | 122.816 | 6773.81 | 153.227 | 10.238 | 0.9013 | 9.87 | 3.176 | 0.9013 | 4701.89 | 5750.44 | 1h 54m |
94+
| INT8 | 62.319 | 3974.97 | 21.722 | 95.438 | 3313.5 | 108.125 | 35.718 | 0.9083 | 33.318 | 3.188 | 0.9066 | 4603.35 | 6738.94 | 58m |
95+
| PRODUCT | 61.694 | 3974.34 | 16.053 | 116.758 | 6923.63 | 199.758 | 1.657 | 0.8608 | 1.383 | 5.992 | 0.8612 | 4723.02 | 5996.45 | 1h 56m |
96+
| BINARY | 62.904 | 4014.12 | 27.06 | 87.609 | 4267.63 | 89.707 | 21.314 | 0.2923 | 14.683 | 14.125 | 0.2923 | 4619.88 | 5879.94 | 1h 13m |
8797

8898
### Commit/Date: main @ 91a86e3 (Thu Jan 15 10:32:50 2026 -0500)
8999

@@ -126,41 +136,37 @@
126136
##### store_vectors_in_graph=False and quantization=INT8
127137

128138
```bash
129-
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/* | sort -h
130-
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
131-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959_vecgraph.5.262144.v0.vecgraph
132-
999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959.4.262144.v0.lsmvecidx
133-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
139+
320K dictionary.0.327680.v0.dict
140+
59M VectorData_0_2689535159251959_vecgraph.5.262144.v0.vecgraph
141+
999M VectorData_0_2689535159251959.4.262144.v0.lsmvecidx
142+
5.6G VectorData_0.1.65536.v0.bucket
134143
```
135144

136145
##### store_vectors_in_graph=True and quantization=INT8
137146

138147
```bash
139-
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/* | sort -h
140-
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
141-
999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566.4.262144.v0.lsmvecidx
142-
3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566_vecgraph.5.262144.v0.vecgraph
143-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
148+
320K dictionary.0.327680.v0.dict
149+
999M VectorData_0_2689534677234566.4.262144.v0.lsmvecidx
150+
3.9G VectorData_0_2689534677234566_vecgraph.5.262144.v0.vecgraph
151+
5.6G VectorData_0.1.65536.v0.bucket
144152
```
145153

146154
##### store_vectors_in_graph=False and quantization=None
147155

148156
```bash
149-
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/* | sort -h
150-
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
151-
11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837.4.262144.v0.lsmvecidx
152-
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837_vecgraph.5.262144.v0.vecgraph
153-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
157+
320K dictionary.0.327680.v0.dict
158+
11M VectorData_0_2689535353426837.4.262144.v0.lsmvecidx
159+
59M VectorData_0_2689535353426837_vecgraph.5.262144.v0.vecgraph
160+
5.6G VectorData_0.1.65536.v0.bucket
154161
```
155162

156163
##### store_vectors_in_graph=True and quantization=None
157164

158165
```bash
159-
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/* | sort -h
160-
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
161-
11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551.4.262144.v0.lsmvecidx
162-
3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551_vecgraph.5.262144.v0.vecgraph
163-
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_1.65536.v0.bucket
166+
320K dictionary.0.327680.v0.dict
167+
11M VectorData_0_2689535105029551.4.262144.v0.lsmvecidx
168+
3.9G VectorData_0_2689535105029551_vecgraph.5.262144.v0.vecgraph
169+
5.6G VectorData_0_1.65536.v0.bucket
164170
```
165171

166172
#### Findings

0 commit comments

Comments
 (0)