|
1 | | -# Vector Search Benchmark: ArcadeDB (JVector) vs FAISS |
2 | | - |
3 | | -This benchmark compares the performance of **ArcadeDB's Vector Index** (based on |
4 | | -JVector + LSM Tree) against **FAISS** (Facebook AI Similarity Search) using standard ANN |
5 | | -datasets. |
6 | | - |
7 | | -## 1. Algorithms Tested |
8 | | - |
9 | | -We evaluated the following vector index implementations: |
10 | | - |
11 | | -- **ArcadeDB (JVector + LSM)**: ArcadeDB uses JVector for graph-based vector indexing, |
12 | | - integrated with an LSM-tree architecture to provide transactional, persistent, and |
13 | | - database-like capabilities. JVector combines the best of **HNSW** (Hierarchical |
14 | | - Navigable Small World) and **DiskANN** algorithms to offer high performance on |
15 | | - disk-based indexes. |
16 | | -- **FAISS**: We tested four popular index types: |
17 | | - - `HNSW` (Hierarchical Navigable Small World) - Graph-based. |
18 | | - - `HNSW_PQ` (HNSW with Product Quantization) - Graph + Compressed. |
19 | | - - `IVF_FLAT` (Inverted File with Flat vectors) - Quantization-based. |
20 | | - - `IVF_PQ` (Inverted File with Product Quantization) - Compressed. |
21 | | - |
22 | | -## 2. Datasets |
23 | | - |
24 | | -We used two widely recognized datasets from `ann-benchmarks`: |
25 | | - |
26 | | -1. **SIFT-128-Euclidean** |
27 | | - |
28 | | - - **Vectors**: 1,000,000 |
29 | | - - **Dimensions**: 128 |
30 | | - - **Metric**: Euclidean Distance |
31 | | - - **Difficulty**: Moderate. |
32 | | - |
33 | | -2. **GloVe-100-Angular** |
34 | | - - **Vectors**: ~1.2 Million (1,183,514) |
35 | | - - **Dimensions**: 100 |
36 | | - - **Metric**: Cosine Similarity |
37 | | - - **Difficulty**: Hard. As seen in the results, all algorithms achieve lower |
38 | | - recall values compared to SIFT for the same parameters. |
39 | | - |
40 | | -## 3. Hardware Environment |
41 | | - |
42 | | -All benchmarks were executed on the following hardware: |
43 | | - |
44 | | -- **CPU**: AMD Ryzen 9 7950X 16-Core Processor |
45 | | -- **RAM**: 128 GB DDR5 (4×32 GB) at 3600 MT/s (Corsair) |
46 | | -- **Disk**: Samsung SSD 970 EVO Plus 2TB |
47 | | -- **GPU**: None (All benchmarks ran on CPU) |
48 | | - |
49 | | -## 4. Benchmark Results |
50 | | - |
51 | | -The following figures visualize the trade-off between **Recall@10** and **Latency |
52 | | -(ms)**. |
53 | | - |
54 | | -- **X-Axis (Recall)**: Higher is better (Right). |
55 | | -- **Y-Axis (Latency)**: Lower is better (Down). |
56 | | -- **Goal**: The ideal performance is in the **bottom-right corner** (High Recall, Low |
57 | | - Latency). |
58 | | - |
59 | | -Each dot represents a specific configuration (parameter set) for an algorithm. We use |
60 | | -scatter plots because connecting dots with lines implies a continuum that doesn't |
61 | | -strictly exist across different discrete parameter combinations (e.g., |
62 | | -`max_connections`, `ef_construction`, `nprobe`). |
63 | | - |
64 | | -For detailed parameter values and raw metrics, please refer to the markdown files in the |
65 | | -[`./results/`](./results/) directory. |
66 | | - |
67 | | -### Note on Legend Metrics |
68 | | - |
69 | | -The legend in the figures displays **Peak Memory** and **Avg Build** time. These metrics |
70 | | -should be interpreted with the following context: |
71 | | - |
72 | | -- **Peak Memory**: This represents the **global maximum RSS** (Resident Set Size) |
73 | | - observed during the entire benchmark run for that algorithm. Since the script |
74 | | - iterates through multiple parameter configurations (some heavier than others) in a |
75 | | - single run, this value reflects the high-water mark of the most resource-intensive |
76 | | - configuration, not necessarily the specific memory usage for every data point shown. |
77 | | -- **Avg Build**: This is the **arithmetic mean** of the build times across all |
78 | | - configurations tested for that algorithm. As build time varies significantly with |
79 | | - parameters (e.g., `max_connections`, `ef_construction`), this serves as a general |
80 | | - ballpark figure rather than a precise measurement for each specific point. |
81 | | - |
82 | | -### SIFT-128-Euclidean Results |
83 | | - |
84 | | - |
85 | | -_(PDF version: [figures/plot_sift-128-euclidean.pdf](figures/plot_sift-128-euclidean.pdf))_ |
86 | | - |
87 | | -### GloVe-100-Angular Results |
88 | | - |
89 | | - |
90 | | -_(PDF version: [figures/plot_glove-100-angular.pdf](figures/plot_glove-100-angular.pdf))_ |
91 | | - |
92 | | -## 5. ArcadeDB Configuration |
93 | | - |
94 | | -For ArcadeDB, we selected the following default configuration which offers a balanced |
95 | | -trade-off between build time, memory usage, and search performance: |
96 | | - |
97 | | -```python |
98 | | -max_connections = 32 |
99 | | -beam_width = 200 |
100 | | -overquery_factor = 16 |
101 | | -``` |
102 | | - |
103 | | -**Note on Quantization**: No quantization (PQ/SQ) was used for the ArcadeDB JVector |
104 | | -benchmarks. Quantization support is currently a Work In Progress (WIP) in the core Java |
105 | | -engine. |
106 | | - |
107 | | -**Note on `overquery_factor`**: Unlike FAISS or standard HNSW implementations, JVector |
108 | | -does not use an `ef` (or `efSearch`) parameter during search. Instead, we implemented an |
109 | | -**"overquery"** mechanism. This retrieves `k * overquery_factor` candidates from the |
110 | | -index, sorts them by exact similarity, and returns the top `k`. This allows trading off |
111 | | -latency for higher recall. |
112 | | - |
113 | | -**Note on Build Time (Lazy Indexing)**: JVector employs lazy indexing, meaning the |
114 | | -initial index object creation is nearly instantaneous. To capture the true cost of |
115 | | -building the graph, our benchmark includes a "warmup" phase that triggers the actual |
116 | | -indexing process. The reported **Build Time** for ArcadeDB is calculated as: `Index |
117 | | -Creation Time + Warmup Time`. |
118 | | - |
119 | | -**Note on Memory Usage**: The ArcadeDB benchmark was executed with a JVM heap limit of |
120 | | -`ARCADEDB_JVM_ARGS='-Xmx16g -Xms16g'`. However, we observed that the actual Resident Set Size |
121 | | -(RSS) memory consumption exceeded this limit significantly, reaching as high as **41GB** |
122 | | -in some test cases. This discrepancy suggests significant off-heap memory usage or other |
123 | | -overheads that require further investigation in the future. |
124 | | - |
125 | | -On the **GloVe-100-Angular** dataset (~1.2M vectors), this configuration achieved: |
126 | | - |
127 | | -- **Recall@10**: 0.8538 |
128 | | -- **Average Latency**: 36ms |
129 | | -- **Build Time**: ~530 seconds |
130 | | - |
131 | | -We consider this "quite decent" for a persistent, disk-based vector store compared to |
132 | | -purely in-memory libraries. |
133 | | - |
134 | | -## 6. Persistence & Stability Observations |
135 | | - |
136 | | -We explicitly tested the persistence of the vector index by closing and reopening the |
137 | | -ArcadeDB database during the benchmark. |
138 | | - |
139 | | -1. **Persistence Verified**: The index correctly persists to disk. We observed that |
140 | | - **query latency remained consistent** before and after reopening the database, |
141 | | - confirming that the index structure is preserved and loaded efficiently without |
142 | | - needing a rebuild. |
143 | | -2. **Recall Stability**: |
144 | | - - **GloVe-100-Angular**: Recall values remained identical before and after the |
145 | | - database restart, as expected. |
146 | | - - **SIFT-128-Euclidean**: We observed a discrepancy in recall values before and |
147 | | - after the restart. While usually small, the difference can sometimes be as high |
148 | | - as **0.1**. The cause of this non-determinism for the Euclidean metric is |
149 | | - currently unknown. However, since our primary production use cases rely on |
150 | | - Cosine similarity (Angular), we have decided to deprioritize investigating this |
151 | | - specific issue for now. |
0 commit comments