Skip to content

Commit 19c885a

Browse files
committed
docs: Enhance README with detailed benchmarking findings and disk usage analysis
1 parent cb89b84 commit 19c885a

1 file changed

Lines changed: 69 additions & 3 deletions

File tree

  • bindings/python/examples/benchmark-vector

bindings/python/examples/benchmark-vector/README.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,75 @@
55
- Data prepared with [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). [Download Cohere MSMARCO v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries (top-50).
66
- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; GT and shards are already computed—ask if you want the bundle to rerun.
77

8-
### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500)
8+
## Benchmark Setup
9+
10+
- **Hardware:** AMD Ryzen 9 7950X (16 cores), 128GB DDR5 (4x32GB), Samsung SSD 970 EVO Plus 2TB.
11+
- Take the duration with a grain of salt, since there are other processes running on the machine. RSS and DB size are more stable. 4 threads were allocated per task, but there aren't always the same number of tasks runing in parallel, so effective CPU usage may vary.
12+
- If not mentioned, `MAX_CONNECTIONS` is fixed as 12, `BEAM_WIDTHS` as 64, and `OVERQUERY_FACTORS` as 1
13+
14+
### Commit/Date: main @ da5e70d (Thu Jan 15 09:44:44 2026 -0500)
15+
16+
- This commit fixes the store_vectors_in_graph issue.
17+
- The below two runs were run with `add_hierarchy=True`
18+
19+
#### MSMARCO-1M (1000 queries, Recall@50)
20+
21+
| quantization | store_vectors_in_graph | ingest_rss_mb | build_rss_mb | warmup_s | warmup_rss_mb | search_s | search_rss_mb | recall@50_before_close | warmup_after_reopen_s | search_after_reopen_s | search_after_reopen_rss_mb | recall@50_after_reopen | peak_rss_mb | db_size_mb | total_duration |
22+
| :----------- | :--------------------- | ------------: | -----------: | -------: | ------------: | -------: | ------------: | ---------------------: | --------------------: | --------------------: | -------------------------: | ---------------------: | ----------: | ---------: | :------------- |
23+
| NONE | True | 8608.92 | 77.746 | 6001.45 | 157.16 | 6.26 | 13.098 | 0.9198 | 6.91 | 20.835 | 11.656 | 0.9197 | 9348.51 | 9650.44 | 1h 41m |
24+
| NONE | False | 8650.15 | 40.016 | 5974.64 | 139.84 | 9.871 | 4.977 | 0.916 | 7.873 | 6.724 | 11.609 | 0.9157 | 9329.95 | 5750.44 | 1h 41m |
25+
| INT8 | True | 8392.88 | 420.625 | 3416.91 | 95.156 | 13.134 | 10.457 | 0.9149 | 4.313 | 67.3 | 10.047 | 0.9072 | 9408.62 | 10638.9 | 59m |
26+
| INT8 | False | 8284.59 | 518.801 | 3404.88 | 112.254 | 9.147 | 13.809 | 0.9146 | 4.457 | 9.165 | 31.578 | 0.9146 | 9437.81 | 6738.94 | 58m |
27+
28+
#### Disk Usage Breakdown
29+
30+
##### store_vectors_in_graph=False and quantization=INT8
31+
32+
```bash
33+
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/* | sort -h
34+
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
35+
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959_vecgraph.5.262144.v0.vecgraph
36+
999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959.4.262144.v0.lsmvecidx
37+
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
38+
```
939

10-
**Hardware:** AMD Ryzen 9 7950X (16 cores), 128GB DDR5 (4x32GB), Samsung SSD 970 EVO Plus 2TB.
40+
##### store_vectors_in_graph=True and quantization=INT8
41+
42+
```bash
43+
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/* | sort -h
44+
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
45+
999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566.4.262144.v0.lsmvecidx
46+
3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566_vecgraph.5.262144.v0.vecgraph
47+
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
48+
```
49+
50+
##### store_vectors_in_graph=False and quantization=None
51+
52+
```bash
53+
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/* | sort -h
54+
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
55+
11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837.4.262144.v0.lsmvecidx
56+
59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837_vecgraph.5.262144.v0.vecgraph
57+
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket
58+
```
59+
60+
##### store_vectors_in_graph=True and quantization=None
61+
62+
```bash
63+
du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/* | sort -h
64+
320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict
65+
11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551.4.262144.v0.lsmvecidx
66+
3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551_vecgraph.5.262144.v0.vecgraph
67+
5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_1.65536.v0.bucket
68+
```
69+
70+
#### Findings
71+
72+
- **Disk Usage:** With `storeVectorsInGraph=False`, the vector graph file (`*.vecgraph`) size drops drastically from ~3.9GB to ~59MB. The total DB size is reduced by ~40% (6.7-5.7GB vs 10.6-9.6GB) by avoiding vector duplication. Note that in INT8 mode (`quant=int8`), there is an additional ~1GB overhead from the `*.lsmvecidx` file compared to `quant=none` (~11MB) because the quantized index structure itself consumes space, even though the bucket size (~5.6GB) remains constant across all runs (since it stores original f32 vectors).
73+
- **Search Performance:** While initial search times are comparable, using `storeVectorsInGraph=True` results in significantly worse search performance after reopening the database.
74+
- **Conclusion:** `storeVectorsInGraph=True` adds no tangible benefits; it increases disk usage and degrades search performance after a restart. Keeping it disabled (`False`) is recommended.
75+
76+
### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500)
1177

1278
#### MSMARCO-1M (1000 queries, Recall@50)
1379

@@ -18,7 +84,7 @@
1884
| NONE | True | False | 67 | 8699 | 6561.28 | 147 | 16 | 10 | 0.9085 | 4 | 13 | 58 | 0.9049 | 9352 | 9645 | 1h 52m |
1985
| NONE | False | False | 66 | 8707 | 6590.55 | 171 | 13 | 16 | 0.8994 | 3 | 13 | 23 | 0.8994 | 9380 | 9645 | 1h 51m |
2086

21-
##### Findings
87+
#### Findings
2288

2389
- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting off-heap/native graph build and mmap traffic dominate.
2490
- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; vectors are effectively stored twice (bucket + graph) because `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes inline vectors. This doubles disk and keeps RSS high when mapping the graph file.

0 commit comments

Comments
 (0)