|
5 | 5 | - Data prepared with [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). [Download Cohere MSMARCO v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries (top-50). |
6 | 6 | - Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; GT and shards are already computed—ask if you want the bundle to rerun. |
7 | 7 |
|
8 | | -### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500) |
| 8 | +## Benchmark Setup |
| 9 | + |
| 10 | +- **Hardware:** AMD Ryzen 9 7950X (16 cores), 128GB DDR5 (4x32GB), Samsung SSD 970 EVO Plus 2TB. |
| 11 | +- Take the duration with a grain of salt, since there are other processes running on the machine. RSS and DB size are more stable. 4 threads were allocated per task, but there aren't always the same number of tasks runing in parallel, so effective CPU usage may vary. |
| 12 | +- If not mentioned, `MAX_CONNECTIONS` is fixed as 12, `BEAM_WIDTHS` as 64, and `OVERQUERY_FACTORS` as 1 |
| 13 | + |
| 14 | +### Commit/Date: main @ da5e70d (Thu Jan 15 09:44:44 2026 -0500) |
| 15 | + |
| 16 | +- This commit fixes the store_vectors_in_graph issue. |
| 17 | +- The below two runs were run with `add_hierarchy=True` |
| 18 | + |
| 19 | +#### MSMARCO-1M (1000 queries, Recall@50) |
| 20 | + |
| 21 | +| quantization | store_vectors_in_graph | ingest_rss_mb | build_rss_mb | warmup_s | warmup_rss_mb | search_s | search_rss_mb | recall@50_before_close | warmup_after_reopen_s | search_after_reopen_s | search_after_reopen_rss_mb | recall@50_after_reopen | peak_rss_mb | db_size_mb | total_duration | |
| 22 | +| :----------- | :--------------------- | ------------: | -----------: | -------: | ------------: | -------: | ------------: | ---------------------: | --------------------: | --------------------: | -------------------------: | ---------------------: | ----------: | ---------: | :------------- | |
| 23 | +| NONE | True | 8608.92 | 77.746 | 6001.45 | 157.16 | 6.26 | 13.098 | 0.9198 | 6.91 | 20.835 | 11.656 | 0.9197 | 9348.51 | 9650.44 | 1h 41m | |
| 24 | +| NONE | False | 8650.15 | 40.016 | 5974.64 | 139.84 | 9.871 | 4.977 | 0.916 | 7.873 | 6.724 | 11.609 | 0.9157 | 9329.95 | 5750.44 | 1h 41m | |
| 25 | +| INT8 | True | 8392.88 | 420.625 | 3416.91 | 95.156 | 13.134 | 10.457 | 0.9149 | 4.313 | 67.3 | 10.047 | 0.9072 | 9408.62 | 10638.9 | 59m | |
| 26 | +| INT8 | False | 8284.59 | 518.801 | 3404.88 | 112.254 | 9.147 | 13.809 | 0.9146 | 4.457 | 9.165 | 31.578 | 0.9146 | 9437.81 | 6738.94 | 58m | |
| 27 | + |
| 28 | +#### Disk Usage Breakdown |
| 29 | + |
| 30 | +##### store_vectors_in_graph=False and quantization=INT8 |
| 31 | + |
| 32 | +```bash |
| 33 | +du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/* | sort -h |
| 34 | +320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict |
| 35 | +59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959_vecgraph.5.262144.v0.vecgraph |
| 36 | +999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535159251959.4.262144.v0.lsmvecidx |
| 37 | +5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket |
| 38 | +``` |
9 | 39 |
|
10 | | -**Hardware:** AMD Ryzen 9 7950X (16 cores), 128GB DDR5 (4x32GB), Samsung SSD 970 EVO Plus 2TB. |
| 40 | +##### store_vectors_in_graph=True and quantization=INT8 |
| 41 | + |
| 42 | +```bash |
| 43 | +du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/* | sort -h |
| 44 | +320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict |
| 45 | +999M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566.4.262144.v0.lsmvecidx |
| 46 | +3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689534677234566_vecgraph.5.262144.v0.vecgraph |
| 47 | +5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=int8_store=on_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket |
| 48 | +``` |
| 49 | + |
| 50 | +##### store_vectors_in_graph=False and quantization=None |
| 51 | + |
| 52 | +```bash |
| 53 | +du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/* | sort -h |
| 54 | +320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict |
| 55 | +11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837.4.262144.v0.lsmvecidx |
| 56 | +59M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0_2689535353426837_vecgraph.5.262144.v0.vecgraph |
| 57 | +5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=off_hier=on_batch=100000_seed=42/VectorData_0.1.65536.v0.bucket |
| 58 | +``` |
| 59 | + |
| 60 | +##### store_vectors_in_graph=True and quantization=None |
| 61 | + |
| 62 | +```bash |
| 63 | +du -sh arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/* | sort -h |
| 64 | +320K arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/dictionary.0.327680.v0.dict |
| 65 | +11M arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551.4.262144.v0.lsmvecidx |
| 66 | +3.9G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_2689535105029551_vecgraph.5.262144.v0.vecgraph |
| 67 | +5.6G arcadedb_runs/dataset=MSMARCO-1M_label=1000000_maxconn=12_beam=64_oq=1_quant=none_store=on_hier=on_batch=100000_seed=42/VectorData_0_1.65536.v0.bucket |
| 68 | +``` |
| 69 | + |
| 70 | +#### Findings |
| 71 | + |
| 72 | +- **Disk Usage:** With `storeVectorsInGraph=False`, the vector graph file (`*.vecgraph`) size drops drastically from ~3.9GB to ~59MB. The total DB size is reduced by ~40% (6.7-5.7GB vs 10.6-9.6GB) by avoiding vector duplication. Note that in INT8 mode (`quant=int8`), there is an additional ~1GB overhead from the `*.lsmvecidx` file compared to `quant=none` (~11MB) because the quantized index structure itself consumes space, even though the bucket size (~5.6GB) remains constant across all runs (since it stores original f32 vectors). |
| 73 | +- **Search Performance:** While initial search times are comparable, using `storeVectorsInGraph=True` results in significantly worse search performance after reopening the database. |
| 74 | +- **Conclusion:** `storeVectorsInGraph=True` adds no tangible benefits; it increases disk usage and degrades search performance after a restart. Keeping it disabled (`False`) is recommended. |
| 75 | + |
| 76 | +### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500) |
11 | 77 |
|
12 | 78 | #### MSMARCO-1M (1000 queries, Recall@50) |
13 | 79 |
|
|
18 | 84 | | NONE | True | False | 67 | 8699 | 6561.28 | 147 | 16 | 10 | 0.9085 | 4 | 13 | 58 | 0.9049 | 9352 | 9645 | 1h 52m | |
19 | 85 | | NONE | False | False | 66 | 8707 | 6590.55 | 171 | 13 | 16 | 0.8994 | 3 | 13 | 23 | 0.8994 | 9380 | 9645 | 1h 51m | |
20 | 86 |
|
21 | | -##### Findings |
| 87 | +#### Findings |
22 | 88 |
|
23 | 89 | - Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting off-heap/native graph build and mmap traffic dominate. |
24 | 90 | - Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; vectors are effectively stored twice (bucket + graph) because `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes inline vectors. This doubles disk and keeps RSS high when mapping the graph file. |
|
0 commit comments