Skip to content

Commit 730a31e

Browse files
committed
feat: enhance vector index performance and update benchmark documentation
- Increased max_connections from 16 to 32 in vector index creation for improved performance. - Expanded benchmark documentation to include detailed comparisons of ArcadeDB's JVector and FAISS, including algorithms tested and key findings. - Added dataset handling improvements in benchmark scripts to ensure datasets are stored in a dedicated directory. - Introduced new plots for GloVe-100 and SIFT-128 benchmark results, including PNG and PDF formats.
1 parent 5c9d794 commit 730a31e

9 files changed

Lines changed: 321 additions & 39 deletions

bindings/python/docs/api/vector.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ index = db.create_vector_index(
266266
vector_property="embedding",
267267
dimensions=384,
268268
distance_function="cosine",
269-
max_connections=16,
269+
max_connections=32,
270270
beam_width=200 # Higher for better recall
271271
)
272272

Lines changed: 68 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,85 @@
11
# Vector Search Benchmark: ArcadeDB (JVector) vs FAISS
22

3-
## TL;DR
4-
* **Speed:** ArcadeDB (JVector) is surprisingly fast, often matching or beating in-memory FAISS.
5-
* **Recall:** ArcadeDB has much lower recall than FAISS (likely due to lazy loading issues).
6-
* **Persistence:** Works correctly, but has a significant "warmup" latency on the first query.
7-
* **Verdict:** Promising performance, but recall issues need addressing for high-precision production use.
3+
This benchmark compares the performance of **ArcadeDB's Vector Index** (based on JVector + LSM Tree) against **FAISS** (Facebook AI Similarity Search) using standard ANN datasets.
84

9-
This project benchmarks the performance and accuracy of **ArcadeDB's embedded Python bindings (JVector)** against **FAISS (HNSW)**, a standard in-memory vector search library.
5+
## 1. Algorithms Tested
106

11-
The goal is to evaluate ArcadeDB's suitability for vector search tasks, specifically focusing on **Recall@k**, **Latency**, and **Persistence**.
7+
We evaluated the following vector index implementations:
128

13-
## Key Findings
9+
* **ArcadeDB (JVector + LSM)**: ArcadeDB uses JVector for graph-based vector indexing, integrated with an LSM-tree architecture to provide transactional, persistent, and database-like capabilities. JVector combines the best of **HNSW** (Hierarchical Navigable Small World) and **DiskANN** algorithms to offer high performance on disk-based indexes.
10+
* **FAISS**: We tested four popular index types:
11+
* `HNSW` (Hierarchical Navigable Small World) - Graph-based.
12+
* `HNSW_PQ` (HNSW with Product Quantization) - Graph + Compressed.
13+
* `IVF_FLAT` (Inverted File with Flat vectors) - Quantization-based.
14+
* `IVF_PQ` (Inverted File with Product Quantization) - Compressed.
1415

15-
### 1. Performance (Speed)
16-
**JVector is surprisingly fast.**
17-
Despite being a persistent database solution, ArcadeDB's JVector implementation demonstrates search latencies that are often comparable to, and in some cases faster than, FAISS (which operates entirely in memory). This is a strong indicator of the efficiency of the underlying JVector implementation.
16+
## 2. Datasets
1817

19-
### 2. Recall & Accuracy
20-
**JVector's recall is currently lower than FAISS.**
21-
While FAISS consistently achieves high recall (>0.99) with appropriate parameter tuning, JVector struggles to match this level of accuracy, particularly as $k$ increases (e.g., $k=50$).
22-
* **Note:** Discussions with ArcadeDB authors suggest this discrepancy might be due to **"lazy loading"** mechanisms within the database, where the graph is not fully traversed or loaded during the search, leading to missed candidates.
23-
* **Implication:** For production use cases requiring strict high-precision recall, this is currently a limiting factor.
18+
We used two widely recognized datasets from `ann-benchmarks`:
2419

25-
### 3. Persistence & Warmup
26-
**Persistence is robust, but "Warmup" is significant.**
27-
* **Robustness:** We verified that the vector index is not corrupted or lost after closing and reopening the database. Recall metrics remained consistent before and after a restart.
28-
* **Warmup Time:** We observed a significant latency spike (warmup time) during the first query after opening the database.
29-
* **Hypothesis:** This suggests that the persistent vector index might be undergoing a lazy load or partial rebuild process upon the first access, rather than being fully ready immediately after the database opens.
20+
1. **SIFT-128-Euclidean**
21+
* **Vectors**: 1,000,000
22+
* **Dimensions**: 128
23+
* **Metric**: Euclidean Distance
24+
* **Difficulty**: Moderate.
3025

31-
### 4. Note on Qdrant
32-
**Qdrant was excluded from the final report.**
33-
Initial benchmarks included `qdrant-client`, but it was excluded due to anomalous results (unexpectedly slow performance paired with consistently perfect recall). This likely indicates a configuration or parameter issue in the test setup rather than a fundamental issue with Qdrant itself.
26+
2. **GloVe-100-Angular**
27+
* **Vectors**: ~1.2 Million (1,183,514)
28+
* **Dimensions**: 100
29+
* **Metric**: Cosine Similarity
30+
* **Difficulty**: Hard. As seen in the results, all algorithms achieve lower recall values compared to SIFT for the same parameters.
3431

35-
## Datasets
32+
## 3. Hardware Environment
3633

37-
The benchmark utilizes standard ANN datasets:
38-
* **SIFT-128-Euclidean**: 1M vectors, 128 dimensions (Metric: Euclidean)
39-
* **GloVe-100-Angular**: 1.2M vectors, 100 dimensions (Metric: Cosine)
34+
All benchmarks were executed on the following hardware:
4035

41-
## Results
36+
* **CPU**: AMD Ryzen 9 7950X 16-Core Processor
37+
* **RAM**: 128 GB DDR5 (4×32 GB) at 3600 MT/s (Corsair)
38+
* **Disk**: Samsung SSD 970 EVO Plus 2TB
39+
* **GPU**: None (All benchmarks ran on CPU)
4240

43-
Detailed markdown reports are generated for each dataset
41+
## 4. Benchmark Results
4442

45-
### JVector
43+
The following figures visualize the trade-off between **Recall@10** and **Latency (ms)**.
4644

47-
* [SIFT-128 Results](benchmark_results_sift-128-euclidean.md)
48-
* [GloVe-100 Results](benchmark_results_glove-100-angular.md)
45+
* **X-Axis (Recall)**: Higher is better (Right).
46+
* **Y-Axis (Latency)**: Lower is better (Down).
47+
* **Goal**: The ideal performance is in the **bottom-right corner** (High Recall, Low Latency).
4948

50-
### FAISS (HNSW)
49+
Each dot represents a specific configuration (parameter set) for an algorithm. We use scatter plots because connecting dots with lines implies a continuum that doesn't strictly exist across different discrete parameter combinations (e.g., `max_connections`, `ef_construction`, `nprobe`).
5150

52-
* [SIFT-128 Results](benchmark_faiss_sift-128-euclidean.md)
53-
* [GloVe-100 Results](benchmark_faiss_glove-100-angular.md)
51+
For detailed parameter values and raw metrics, please refer to the markdown files in the [`./results/`](./results/) directory.
52+
53+
### Note on Legend Metrics
54+
55+
The legend in the figures displays **Peak Memory** and **Avg Build** time. These metrics should be interpreted with the following context:
56+
57+
* **Peak Memory**: This represents the **global maximum RSS** (Resident Set Size) observed during the entire benchmark run for that algorithm. Since the script iterates through multiple parameter configurations (some heavier than others) in a single run, this value reflects the high-water mark of the most resource-intensive configuration, not necessarily the specific memory usage for every data point shown.
58+
* **Avg Build**: This is the **arithmetic mean** of the build times across all configurations tested for that algorithm. As build time varies significantly with parameters (e.g., `max_connections`, `ef_construction`), this serves as a general ballpark figure rather than a precise measurement for each specific point.
59+
60+
### SIFT-128-Euclidean Results
61+
62+
![SIFT Results](figures/plot_sift-128-euclidean.png)
63+
*(PDF version: [figures/plot_sift-128-euclidean.pdf](figures/plot_sift-128-euclidean.pdf))*
64+
65+
### GloVe-100-Angular Results
66+
67+
![GloVe Results](figures/plot_glove-100-angular.png)
68+
*(PDF version: [figures/plot_glove-100-angular.pdf](figures/plot_glove-100-angular.pdf))*
69+
70+
## 5. ArcadeDB Configuration
71+
72+
For ArcadeDB, we selected the following default configuration which offers a balanced trade-off between build time, memory usage, and search performance:
73+
74+
```python
75+
max_connections = 32
76+
beam_width = 200
77+
overquery_factor = 16
78+
```
79+
**Note on `overquery_factor`**: Unlike FAISS or standard HNSW implementations, JVector does not use an `ef` (or `efSearch`) parameter during search. Instead, we implemented an **"overquery"** mechanism. This retrieves `k * overquery_factor` candidates from the index, sorts them by exact similarity, and returns the top `k`. This allows trading off latency for higher recall.
80+
On the **GloVe-100-Angular** dataset (~1.2M vectors), this configuration achieved:
81+
* **Recall@10**: 0.8538
82+
* **Average Latency**: 36ms
83+
* **Build Time**: ~530 seconds
84+
85+
We consider this "quite decent" for a persistent, disk-based vector store compared to purely in-memory libraries.

bindings/python/examples/benchmark-vector/benchmark_vector_params-faiss.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,9 @@ def download_dataset(url, path):
8282

8383

8484
def load_ann_data(dataset, count, num_queries, k_values):
85-
path = f"{dataset}.hdf5"
85+
if not os.path.exists("datasets"):
86+
os.makedirs("datasets")
87+
path = os.path.join("datasets", f"{dataset}.hdf5")
8688
download_dataset(DATASETS[dataset], path)
8789

8890
with h5py.File(path, "r") as f:

bindings/python/examples/benchmark-vector/benchmark_vector_params.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,9 @@ def load_ann_benchmark_data(dataset_name, count, num_queries, k_values):
114114
if not url:
115115
raise ValueError(f"Unknown dataset: {dataset_name}")
116116

117-
filename = f"{dataset_name}.hdf5"
117+
if not os.path.exists("datasets"):
118+
os.makedirs("datasets")
119+
filename = os.path.join("datasets", f"{dataset_name}.hdf5")
118120
download_dataset(url, filename)
119121

120122
print(f" Loading {dataset_name} from {filename}...")
Binary file not shown.
119 KB
Loading
Binary file not shown.
119 KB
Loading
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
import glob
2+
import os
3+
import re
4+
5+
import matplotlib.pyplot as plt
6+
import pandas as pd
7+
8+
# Set larger font sizes for all plot elements
9+
plt.rcParams.update(
10+
{
11+
"font.size": 14,
12+
"axes.titlesize": 18,
13+
"axes.labelsize": 16,
14+
"xtick.labelsize": 14,
15+
"ytick.labelsize": 14,
16+
"legend.fontsize": 12,
17+
"figure.titlesize": 20,
18+
}
19+
)
20+
21+
RESULTS_DIR = "results"
22+
LOGS_DIR = "benchmark_logs"
23+
FIGURES_DIR = "figures"
24+
25+
if not os.path.exists(FIGURES_DIR):
26+
os.makedirs(FIGURES_DIR)
27+
28+
DATASETS = {
29+
"sift-128-euclidean": "SIFT-128-Euclidean",
30+
"glove-100-angular": "GloVe-100-Angular",
31+
}
32+
33+
ALGORITHMS = {
34+
"jvector": "JVector",
35+
"faiss_hnsw": "FAISS HNSW Flat",
36+
"faiss_ivf_flat": "FAISS IVF Flat",
37+
"faiss_ivf_pq": "FAISS IVF PQ",
38+
"faiss_hnsw_pq": "FAISS HNSW PQ",
39+
}
40+
41+
42+
def parse_markdown_table(file_path):
43+
with open(file_path, "r") as f:
44+
lines = f.readlines()
45+
46+
# Find the table
47+
table_lines = [line.strip() for line in lines if line.strip().startswith("|")]
48+
49+
if not table_lines:
50+
return pd.DataFrame()
51+
52+
# Parse header
53+
header = [c.strip() for c in table_lines[0].strip("|").split("|")]
54+
55+
# Parse rows
56+
data = []
57+
for line in table_lines[2:]: # Skip header and separator
58+
values = [c.strip() for c in line.strip("|").split("|")]
59+
if len(values) != len(header):
60+
continue
61+
row = dict(zip(header, values))
62+
data.append(row)
63+
64+
df = pd.DataFrame(data)
65+
66+
# Convert numeric columns
67+
for col in df.columns:
68+
try:
69+
df[col] = pd.to_numeric(df[col])
70+
except ValueError:
71+
pass
72+
73+
return df
74+
75+
76+
def get_peak_memory(log_file):
77+
if not os.path.exists(log_file):
78+
return None
79+
try:
80+
df = pd.read_csv(log_file)
81+
return df["RSS_MB"].max()
82+
except Exception as e:
83+
print(f"Error reading log file {log_file}: {e}")
84+
return None
85+
86+
87+
def plot_dataset(dataset_key, dataset_name):
88+
plt.figure(figsize=(12, 8))
89+
90+
# Find all result files for this dataset
91+
# Pattern: benchmark_{algo}_{dataset}_{params}.md
92+
# But the filenames are like: benchmark_jvector_sift-128-euclidean_size_full.md
93+
# benchmark_faiss_sift-128-euclidean_hnsw_full.md
94+
95+
# We need to map filename patterns to algorithms
96+
97+
files = glob.glob(os.path.join(RESULTS_DIR, f"*{dataset_key}*.md"))
98+
99+
colors = plt.cm.tab10.colors
100+
101+
plot_data = []
102+
103+
for file_path in files:
104+
filename = os.path.basename(file_path)
105+
106+
# Determine algorithm
107+
algo_name = "Unknown"
108+
is_jvector = False
109+
110+
if "jvector" in filename:
111+
algo_name = "JVector"
112+
is_jvector = True
113+
log_pattern = f"jvector-*-full_*_memory.log" # Simplified pattern matching
114+
# Need to be more specific to match dataset
115+
if "euclidean" in dataset_key:
116+
log_pattern = "jvector-euclidean-full_*_memory.log"
117+
else:
118+
log_pattern = "jvector-angular-full_*_memory.log"
119+
120+
elif "faiss" in filename:
121+
dataset_type = "euclidean" if "euclidean" in dataset_key else "angular"
122+
123+
if "hnsw_pq" in filename:
124+
algo_name = "FAISS HNSW PQ"
125+
log_pattern = f"faiss-{dataset_type}-hnsw_pq-full_*_memory.log"
126+
elif "ivf_pq" in filename:
127+
algo_name = "FAISS IVF PQ"
128+
log_pattern = f"faiss-{dataset_type}-ivf_pq-full_*_memory.log"
129+
elif "ivf_flat" in filename:
130+
algo_name = "FAISS IVF Flat"
131+
log_pattern = f"faiss-{dataset_type}-ivf_flat-full_*_memory.log"
132+
elif "hnsw" in filename: # Check this last as hnsw_pq contains hnsw
133+
algo_name = "FAISS HNSW Flat"
134+
log_pattern = f"faiss-{dataset_type}-hnsw-full_*_memory.log"
135+
136+
df = parse_markdown_table(file_path)
137+
if df.empty:
138+
continue
139+
140+
# Get Peak Memory
141+
log_files = glob.glob(os.path.join(LOGS_DIR, log_pattern))
142+
peak_mem = "N/A"
143+
if log_files:
144+
# Pick the most recent one or just the first one
145+
log_file = sorted(log_files)[-1]
146+
mem = get_peak_memory(log_file)
147+
if mem:
148+
peak_mem = f"{mem:.0f} MB"
149+
150+
# Calculate Build Time
151+
# For JVector: Build + Warmup (Before)
152+
# For FAISS: Build
153+
# We take the mean build time if there are multiple rows, or just the first one
154+
# since build time is per index Actually, build time is constant for the same
155+
# build parameters. But here we might have different build parameters in the
156+
# same file (e.g. JVector has max_connections, beam_width). So Build Time
157+
# varies. We can't put a single Build Time in the legend if it varies. Let's
158+
# check if it varies significantly. For FAISS HNSW, Build Time depends on M and
159+
# efConstruction. So it varies. So we can't put it in the legend easily unless
160+
# we average it or show a range. Or maybe just "Avg Build: ...".
161+
162+
if is_jvector:
163+
df["Total Build"] = 0.0
164+
if "Build (s)" in df.columns:
165+
df["Total Build"] += df["Build (s)"]
166+
if "Warmup (s) (Before)" in df.columns:
167+
df["Total Build"] += df["Warmup (s) (Before)"]
168+
else:
169+
df["Total Build"] = 0.0
170+
if "Build (s)" in df.columns:
171+
df["Total Build"] += df["Build (s)"]
172+
173+
avg_build = df["Total Build"].mean()
174+
build_str = f"{avg_build:.1f}s"
175+
176+
# Prepare data for plotting
177+
# We want the Pareto frontier (best recall for given latency or best latency for
178+
# given recall)
179+
# But simply plotting all points is also fine to see the spread.
180+
# Usually benchmarks plot the line connecting the best points.
181+
182+
# Sort by Recall
183+
df = df.sort_values(by="Recall (Before)")
184+
185+
recall = df["Recall (Before)"]
186+
latency = df["Latency (ms) (Before)"]
187+
188+
# Filter out very bad points if necessary, but let's plot all first.
189+
190+
label = f"{algo_name} (Mem: {peak_mem}, Build: ~{build_str})"
191+
192+
print(f" {algo_name}: Peak Mem={peak_mem}, Avg Build={build_str}")
193+
194+
plot_data.append(
195+
{
196+
"recall": recall,
197+
"latency": latency,
198+
"label": label,
199+
"avg_build": avg_build,
200+
}
201+
)
202+
203+
# Sort by avg_build descending
204+
plot_data.sort(key=lambda x: x["avg_build"], reverse=True)
205+
206+
for data in plot_data:
207+
# Plot points
208+
plt.plot(
209+
data["recall"],
210+
data["latency"],
211+
"o",
212+
label=data["label"],
213+
markersize=8,
214+
alpha=0.7,
215+
)
216+
217+
# Add dashed lines for Recall
218+
for r in [0.80, 0.85, 0.90, 0.95]:
219+
plt.axvline(x=r, color="gray", linestyle="--", alpha=0.5)
220+
plt.text(
221+
r, plt.ylim()[1] * 0.01, f"{r}", rotation=90, verticalalignment="bottom"
222+
)
223+
224+
plt.title(f"Recall vs Latency - {dataset_name} Dataset")
225+
plt.xlabel("Recall@10 (Higher is Better)")
226+
plt.ylabel("Latency (ms) (Lower is Better)")
227+
plt.grid(True, which="both", ls="-", alpha=0.2)
228+
plt.legend()
229+
plt.yscale("log") # Latency often spans orders of magnitude
230+
231+
# Save plot
232+
output_png = os.path.join(FIGURES_DIR, f"plot_{dataset_key}.png")
233+
output_pdf = os.path.join(FIGURES_DIR, f"plot_{dataset_key}.pdf")
234+
plt.savefig(output_png)
235+
plt.savefig(output_pdf)
236+
print(f"Saved plots to {output_png} and {output_pdf}")
237+
238+
239+
def main():
240+
for key, name in DATASETS.items():
241+
print(f"Processing {name} dataset...")
242+
plot_dataset(key, name)
243+
244+
245+
if __name__ == "__main__":
246+
main()

0 commit comments

Comments
 (0)