docs: Refactor and enhance clarity of benchmarking findings in README

tae898 · tae898 · commit 736762a52096 · 2026-01-15T11:12:12.000+01:00
diff --git a/bindings/python/examples/benchmark-vector/README.md b/bindings/python/examples/benchmark-vector/README.md
@@ -2,14 +2,8 @@
 
 ## MSMARCO Dataset
 
-- Data prepared with
-  [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py).
-  [Download Cohere MSMARCO
-  v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet
-  shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries
-  (top-50).
-- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors;
-  GT and shards are already computed—ask if you want the bundle to rerun.
+- Data prepared with [convert-msmacro-parquet-to-shards.py](./convert-msmacro-parquet-to-shards.py). [Download Cohere MSMARCO v2.1](https://huggingface.co/datasets/Cohere/msmarco-v2.1-embed-english-v3) parquet shards, normalize to float32, write flat f32 shards, and build exact GT for 1K queries (top-50).
+- Benchmarks here use the 1M subset. For production/RAG we should target 10M+ vectors; GT and shards are already computed—ask if you want the bundle to rerun.
 
 ### Commit/Date: main @ d8098d7 (Wed Jan 14 15:20:25 2026 -0500)
 
@@ -24,32 +18,11 @@
 
 ##### Findings
 
-- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all
-  runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting
-  off-heap/native graph build and mmap traffic dominate.
-- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`;
-  vectors are effectively stored twice (bucket + graph) because
-  `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes
-  inline vectors. This doubles disk and keeps RSS high when mapping the graph file.
-- Lazy build + rebuild: Graph is built only after the first search, so the first query
-  does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`,
-  and the search path currently rebuilds on the very next query since it only checks
-  `mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default
-  100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone
-  never trigger rebuilds.
-- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now
-  persists and reloads the graph successfully. The reopen warmup is mostly graph load,
-  not rebuild.
-- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but
-  improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s
-  across 1K queries); likely fewer hops during graph search.
-- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable
-  recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size
-  increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph
-  and/or stored as float alongside the int8 quantized form.
-- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will
-  improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then
-  rerank) is the lever; overquery factor was 1 here to simplify results.
-- Next steps: The ArcadeDB team is looking into the vector duplication and
-  rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for
-  verification and then move to the 10M benchmark.
+- Memory: JVM heap capped at 8GB, yet RSS (Resident Set Size) peaks 9.3–9.5GB in all runs; forcing 4GB causes OOM. Even a 1M dataset pushes outside heap, suggesting off-heap/native graph build and mmap traffic dominate.
+- Storage: Each run writes ~1.0GB `*.lsmvecidx` + ~5.6GB bucket + ~3.9GB `*.vecgraph`; vectors are effectively stored twice (bucket + graph) because `store_vectors_in_graph=False` is ignored—LSMVectorIndexGraphFile still serializes inline vectors. This doubles disk and keeps RSS high when mapping the graph file.
+- Lazy build + rebuild: Graph is built only after the first search, so the first query does all construction (long warmup). Post-ingest mutations set `graphState=MUTABLE`, and the search path currently rebuilds on the very next query since it only checks `mutationsSinceSerialize>0`; the configured threshold (GlobalConfiguration default 100) is bypassed. Pure queries do not increment the counter, so 1,000 searches alone never trigger rebuilds.
+- Persistence: Close/reopen shows no rebuild because the Jan 14, 2026 engine fix now persists and reloads the graph successfully. The reopen warmup is mostly graph load, not rebuild.
+- Hierarchy: `add_hierarchy` raises build time modestly (~+9m: 2h00 vs. 1h51) but improves recall (0.9101 vs. 0.8994) and cuts search time materially (6s vs. 13–16s across 1K queries); likely fewer hops during graph search.
+- Quantization (int8): Ingest time drops sharply (1h07 vs. ~1h51–2h00) with comparable recall (0.9072 vs. 0.8994 baseline). However RSS does not improve and db size increases (10.6GB vs. 9.6GB), likely because vectors are duplicated in the graph and/or stored as float alongside the int8 quantized form.
+- JVector knobs: `MAX_CONNECTIONS=12` and `BEAM_WIDTH=64` held constant; higher will improve recall at higher build cost. JVector lacks `efSearch`, so overquery (>k then rerank) is the lever; overquery factor was 1 here to simplify results.
+- Next steps: The ArcadeDB team is looking into the vector duplication and rebuild-threshold issues. Once fixes land, we will rerun on the 1M set for verification and then move to the 10M benchmark.