|
2 | 2 |
|
3 | 3 | ## MSMARCO Dataset |
4 | 4 |
|
5 | | -- Data prepared with |
6 | | - [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). |
7 | | - [Download Cohere MSMARCO |
8 | | - v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet |
9 | | - shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries |
10 | | - (top-50). |
11 | | -- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; |
12 | | - GT and shards are already computed—ask if you want the bundle to rerun. |
| 5 | +- Data prepared with [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). [Download Cohere MSMARCO v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries (top-50). |
| 6 | +- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; GT and shards are already computed—ask if you want the bundle to rerun. |
13 | 7 |
|
14 | 8 | ### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500) |
15 | 9 |
|
|
24 | 18 |
|
25 | 19 | ##### Findings |
26 | 20 |
|
27 | | -- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all |
28 | | - runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting |
29 | | - off-heap/native graph build and mmap traffic dominate. |
30 | | -- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; |
31 | | - vectors are effectively stored twice (bucket + graph) because |
32 | | - `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes |
33 | | - inline vectors. This doubles disk and keeps RSS high when mapping the graph file. |
34 | | -- Lazy build + rebuild: Graph is built only after the first search, so the first query |
35 | | - does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`, |
36 | | - and the search path currently rebuilds on the very next query since it only checks |
37 | | - `mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default |
38 | | - 100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone |
39 | | - never trigger rebuilds. |
40 | | -- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now |
41 | | - persists and reloads the graph successfully. The reopen warmup is mostly graph load, |
42 | | - not rebuild. |
43 | | -- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but |
44 | | - improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s |
45 | | - across 1K queries); likely fewer hops during graph search. |
46 | | -- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable |
47 | | - recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size |
48 | | - increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph |
49 | | - and/or stored as float alongside the int8 quantized form. |
50 | | -- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will |
51 | | - improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then |
52 | | - rerank) is the lever; overquery factor was 1 here to simplify results. |
53 | | -- Next steps: The ArcadeDB team is looking into the vector duplication and |
54 | | - rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for |
55 | | - verification and then move to the 10M benchmark. |
| 21 | +- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting off-heap/native graph build and mmap traffic dominate. |
| 22 | +- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; vectors are effectively stored twice (bucket + graph) because `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes inline vectors. This doubles disk and keeps RSS high when mapping the graph file. |
| 23 | +- Lazy build + rebuild: Graph is built only after the first search, so the first query does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`, and the search path currently rebuilds on the very next query since it only checks `mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default 100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone never trigger rebuilds. |
| 24 | +- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now persists and reloads the graph successfully. The reopen warmup is mostly graph load, not rebuild. |
| 25 | +- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s across 1K queries); likely fewer hops during graph search. |
| 26 | +- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph and/or stored as float alongside the int8 quantized form. |
| 27 | +- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then rerank) is the lever; overquery factor was 1 here to simplify results. |
| 28 | +- Next steps: The ArcadeDB team is looking into the vector duplication and rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for verification and then move to the 10M benchmark. |
0 commit comments