Skip to content

Commit 736762a

Browse files
committed
docs: Refactor and enhance clarity of benchmarking findings in README
1 parent b1e217a commit 736762a

1 file changed

Lines changed: 10 additions & 37 deletions

File tree

  • bindings/python/examples/benchmark-vector

bindings/python/examples/benchmark-vector/README.md

Lines changed: 10 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,8 @@
22

33
## MSMARCO Dataset
44

5-
- Data prepared with
6-
[convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py).
7-
[Download Cohere MSMARCO
8-
v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet
9-
shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries
10-
(top-50).
11-
- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors;
12-
GT and shards are already computed—ask if you want the bundle to rerun.
5+
- Data prepared with [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). [Download Cohere MSMARCO v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries (top-50).
6+
- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; GT and shards are already computed—ask if you want the bundle to rerun.
137

148
### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500)
159

@@ -24,32 +18,11 @@
2418

2519
##### Findings
2620

27-
- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all
28-
runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting
29-
off-heap/native graph build and mmap traffic dominate.
30-
- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`;
31-
vectors are effectively stored twice (bucket + graph) because
32-
`store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes
33-
inline vectors. This doubles disk and keeps RSS high when mapping the graph file.
34-
- Lazy build + rebuild: Graph is built only after the first search, so the first query
35-
does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`,
36-
and the search path currently rebuilds on the very next query since it only checks
37-
`mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default
38-
100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone
39-
never trigger rebuilds.
40-
- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now
41-
persists and reloads the graph successfully. The reopen warmup is mostly graph load,
42-
not rebuild.
43-
- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but
44-
improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s
45-
across 1K queries); likely fewer hops during graph search.
46-
- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable
47-
recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size
48-
increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph
49-
and/or stored as float alongside the int8 quantized form.
50-
- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will
51-
improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then
52-
rerank) is the lever; overquery factor was 1 here to simplify results.
53-
- Next steps: The ArcadeDB team is looking into the vector duplication and
54-
rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for
55-
verification and then move to the 10M benchmark.
21+
- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting off-heap/native graph build and mmap traffic dominate.
22+
- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; vectors are effectively stored twice (bucket + graph) because `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes inline vectors. This doubles disk and keeps RSS high when mapping the graph file.
23+
- Lazy build + rebuild: Graph is built only after the first search, so the first query does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`, and the search path currently rebuilds on the very next query since it only checks `mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default 100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone never trigger rebuilds.
24+
- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now persists and reloads the graph successfully. The reopen warmup is mostly graph load, not rebuild.
25+
- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s across 1K queries); likely fewer hops during graph search.
26+
- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph and/or stored as float alongside the int8 quantized form.
27+
- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then rerank) is the lever; overquery factor was 1 here to simplify results.
28+
- Next steps: The ArcadeDB team is looking into the vector duplication and rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for verification and then move to the 10M benchmark.

0 commit comments

Comments
 (0)