humemai
diff --git a/‎bindings/python/docs/api/vector.md‎
Lines changed: 1 addition & 1 deletion b/‎bindings/python/docs/api/vector.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎bindings/python/examples/benchmark-vector/README.md‎
Lines changed: 68 additions & 36 deletions b/‎bindings/python/examples/benchmark-vector/README.md‎
Lines changed: 68 additions & 36 deletions
diff --git a/‎bindings/python/examples/benchmark-vector/benchmark_vector_params-faiss.py‎
Lines changed: 3 additions & 1 deletion b/‎bindings/python/examples/benchmark-vector/benchmark_vector_params-faiss.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎bindings/python/examples/benchmark-vector/benchmark_vector_params.py‎
Lines changed: 3 additions & 1 deletion b/‎bindings/python/examples/benchmark-vector/benchmark_vector_params.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎bindings/python/examples/benchmark-vector/figures/plot_glove-100-angular.pdf‎
23.9 KB b/‎bindings/python/examples/benchmark-vector/figures/plot_glove-100-angular.pdf‎
23.9 KB
diff --git a/‎bindings/python/examples/benchmark-vector/figures/plot_glove-100-angular.png‎
119 KB b/‎bindings/python/examples/benchmark-vector/figures/plot_glove-100-angular.png‎
119 KB
diff --git a/‎bindings/python/examples/benchmark-vector/figures/plot_sift-128-euclidean.pdf‎
24.1 KB b/‎bindings/python/examples/benchmark-vector/figures/plot_sift-128-euclidean.pdf‎
24.1 KB
diff --git a/‎bindings/python/examples/benchmark-vector/figures/plot_sift-128-euclidean.png‎
119 KB b/‎bindings/python/examples/benchmark-vector/figures/plot_sift-128-euclidean.png‎
119 KB
diff --git a/‎bindings/python/examples/benchmark-vector/plot_benchmark_results.py‎
Lines changed: 246 additions & 0 deletions b/‎bindings/python/examples/benchmark-vector/plot_benchmark_results.py‎
Lines changed: 246 additions & 0 deletions
@@ -266,7 +266,7 @@ index = db.create_vector_index(
     vector_property="embedding",
     dimensions=384,
     distance_function="cosine",
-    max_connections=16,
+    max_connections=32,
     beam_width=200  # Higher for better recall
 )
 
 
@@ -1,53 +1,85 @@
 # Vector Search Benchmark: ArcadeDB (JVector) vs FAISS
 
-## TL;DR
-*   **Speed:** ArcadeDB (JVector) is surprisingly fast, often matching or beating in-memory FAISS.
-*   **Recall:** ArcadeDB has much lower recall than FAISS (likely due to lazy loading issues).
-*   **Persistence:** Works correctly, but has a significant "warmup" latency on the first query.
-*   **Verdict:** Promising performance, but recall issues need addressing for high-precision production use.
+This benchmark compares the performance of **ArcadeDB's Vector Index** (based on JVector + LSM Tree) against **FAISS** (Facebook AI Similarity Search) using standard ANN datasets.
 
-This project benchmarks the performance and accuracy of **ArcadeDB's embedded Python bindings (JVector)** against **FAISS (HNSW)**, a standard in-memory vector search library.
+## 1. Algorithms Tested
 
-The goal is to evaluate ArcadeDB's suitability for vector search tasks, specifically focusing on **Recall@k**, **Latency**, and **Persistence**.
+We evaluated the following vector index implementations:
 
-## Key Findings
+*   **ArcadeDB (JVector + LSM)**: ArcadeDB uses JVector for graph-based vector indexing, integrated with an LSM-tree architecture to provide transactional, persistent, and database-like capabilities. JVector combines the best of **HNSW** (Hierarchical Navigable Small World) and **DiskANN** algorithms to offer high performance on disk-based indexes.
+*   **FAISS**: We tested four popular index types:
+    *   `HNSW` (Hierarchical Navigable Small World) - Graph-based.
+    *   `HNSW_PQ` (HNSW with Product Quantization) - Graph + Compressed.
+    *   `IVF_FLAT` (Inverted File with Flat vectors) - Quantization-based.
+    *   `IVF_PQ` (Inverted File with Product Quantization) - Compressed.
 
-### 1. Performance (Speed)
-**JVector is surprisingly fast.**
-Despite being a persistent database solution, ArcadeDB's JVector implementation demonstrates search latencies that are often comparable to, and in some cases faster than, FAISS (which operates entirely in memory). This is a strong indicator of the efficiency of the underlying JVector implementation.
+## 2. Datasets
 
-### 2. Recall & Accuracy
-**JVector's recall is currently lower than FAISS.**
-While FAISS consistently achieves high recall (>0.99) with appropriate parameter tuning, JVector struggles to match this level of accuracy, particularly as $k$ increases (e.g., $k=50$).
-*   **Note:** Discussions with ArcadeDB authors suggest this discrepancy might be due to **"lazy loading"** mechanisms within the database, where the graph is not fully traversed or loaded during the search, leading to missed candidates.
-*   **Implication:** For production use cases requiring strict high-precision recall, this is currently a limiting factor.
+We used two widely recognized datasets from `ann-benchmarks`:
 
-### 3. Persistence & Warmup
-**Persistence is robust, but "Warmup" is significant.**
-*   **Robustness:** We verified that the vector index is not corrupted or lost after closing and reopening the database. Recall metrics remained consistent before and after a restart.
-*   **Warmup Time:** We observed a significant latency spike (warmup time) during the first query after opening the database.
-*   **Hypothesis:** This suggests that the persistent vector index might be undergoing a lazy load or partial rebuild process upon the first access, rather than being fully ready immediately after the database opens.
+1.  **SIFT-128-Euclidean**
+    *   **Vectors**: 1,000,000
+    *   **Dimensions**: 128
+    *   **Metric**: Euclidean Distance
+    *   **Difficulty**: Moderate.
 
-### 4. Note on Qdrant
-**Qdrant was excluded from the final report.**
-Initial benchmarks included `qdrant-client`, but it was excluded due to anomalous results (unexpectedly slow performance paired with consistently perfect recall). This likely indicates a configuration or parameter issue in the test setup rather than a fundamental issue with Qdrant itself.
+2.  **GloVe-100-Angular**
+    *   **Vectors**: ~1.2 Million (1,183,514)
+    *   **Dimensions**: 100
+    *   **Metric**: Cosine Similarity
+    *   **Difficulty**: Hard. As seen in the results, all algorithms achieve lower recall values compared to SIFT for the same parameters.
 
-## Datasets
+## 3. Hardware Environment
 
-The benchmark utilizes standard ANN datasets:
-*   **SIFT-128-Euclidean**: 1M vectors, 128 dimensions (Metric: Euclidean)
-*   **GloVe-100-Angular**: 1.2M vectors, 100 dimensions (Metric: Cosine)
+All benchmarks were executed on the following hardware:
 
-## Results
+*   **CPU**: AMD Ryzen 9 7950X 16-Core Processor
+*   **RAM**: 128 GB DDR5 (4×32 GB) at 3600 MT/s (Corsair)
+*   **Disk**: Samsung SSD 970 EVO Plus 2TB
+*   **GPU**: None (All benchmarks ran on CPU)
 
-Detailed markdown reports are generated for each dataset
+## 4. Benchmark Results
 
-### JVector
+The following figures visualize the trade-off between **Recall@10** and **Latency (ms)**.
 
-*   [SIFT-128 Results](benchmark_results_sift-128-euclidean.md)
-*   [GloVe-100 Results](benchmark_results_glove-100-angular.md)
+*   **X-Axis (Recall)**: Higher is better (Right).
+*   **Y-Axis (Latency)**: Lower is better (Down).
+*   **Goal**: The ideal performance is in the **bottom-right corner** (High Recall, Low Latency).
 
-### FAISS (HNSW)
+Each dot represents a specific configuration (parameter set) for an algorithm. We use scatter plots because connecting dots with lines implies a continuum that doesn't strictly exist across different discrete parameter combinations (e.g., `max_connections`, `ef_construction`, `nprobe`).
 
-*   [SIFT-128 Results](benchmark_faiss_sift-128-euclidean.md)
-*   [GloVe-100 Results](benchmark_faiss_glove-100-angular.md)
+For detailed parameter values and raw metrics, please refer to the markdown files in the [`./results/`](./results/) directory.
+
+### Note on Legend Metrics
+
+The legend in the figures displays **Peak Memory** and **Avg Build** time. These metrics should be interpreted with the following context:
+
+*   **Peak Memory**: This represents the **global maximum RSS** (Resident Set Size) observed during the entire benchmark run for that algorithm. Since the script iterates through multiple parameter configurations (some heavier than others) in a single run, this value reflects the high-water mark of the most resource-intensive configuration, not necessarily the specific memory usage for every data point shown.
+*   **Avg Build**: This is the **arithmetic mean** of the build times across all configurations tested for that algorithm. As build time varies significantly with parameters (e.g., `max_connections`, `ef_construction`), this serves as a general ballpark figure rather than a precise measurement for each specific point.
+
+### SIFT-128-Euclidean Results
+
+![SIFT Results](figures/plot_sift-128-euclidean.png)
+*(PDF version: [figures/plot_sift-128-euclidean.pdf](figures/plot_sift-128-euclidean.pdf))*
+
+### GloVe-100-Angular Results
+
+![GloVe Results](figures/plot_glove-100-angular.png)
+*(PDF version: [figures/plot_glove-100-angular.pdf](figures/plot_glove-100-angular.pdf))*
+
+## 5. ArcadeDB Configuration
+
+For ArcadeDB, we selected the following default configuration which offers a balanced trade-off between build time, memory usage, and search performance:
+
+```python
+max_connections = 32
+beam_width = 200
+overquery_factor = 16
+```
+**Note on `overquery_factor`**: Unlike FAISS or standard HNSW implementations, JVector does not use an `ef` (or `efSearch`) parameter during search. Instead, we implemented an **"overquery"** mechanism. This retrieves `k * overquery_factor` candidates from the index, sorts them by exact similarity, and returns the top `k`. This allows trading off latency for higher recall.
+On the **GloVe-100-Angular** dataset (~1.2M vectors), this configuration achieved:
+*   **Recall@10**: 0.8538
+*   **Average Latency**: 36ms
+*   **Build Time**: ~530 seconds
+
+We consider this "quite decent" for a persistent, disk-based vector store compared to purely in-memory libraries.
@@ -82,7 +82,9 @@ def download_dataset(url, path):
 
 
 def load_ann_data(dataset, count, num_queries, k_values):
-    path = f"{dataset}.hdf5"
+    if not os.path.exists("datasets"):
+        os.makedirs("datasets")
+    path = os.path.join("datasets", f"{dataset}.hdf5")
     download_dataset(DATASETS[dataset], path)
 
     with h5py.File(path, "r") as f:
 
@@ -114,7 +114,9 @@ def load_ann_benchmark_data(dataset_name, count, num_queries, k_values):
     if not url:
         raise ValueError(f"Unknown dataset: {dataset_name}")
 
-    filename = f"{dataset_name}.hdf5"
+    if not os.path.exists("datasets"):
+        os.makedirs("datasets")
+    filename = os.path.join("datasets", f"{dataset_name}.hdf5")
     download_dataset(url, filename)
 
     print(f"  Loading {dataset_name} from {filename}...")
 
@@ -0,0 +1,246 @@
+import glob
+import os
+import re
+
+import matplotlib.pyplot as plt
+import pandas as pd
+
+# Set larger font sizes for all plot elements
+plt.rcParams.update(
+    {
+        "font.size": 14,
+        "axes.titlesize": 18,
+        "axes.labelsize": 16,
+        "xtick.labelsize": 14,
+        "ytick.labelsize": 14,
+        "legend.fontsize": 12,
+        "figure.titlesize": 20,
+    }
+)
+
+RESULTS_DIR = "results"
+LOGS_DIR = "benchmark_logs"
+FIGURES_DIR = "figures"
+
+if not os.path.exists(FIGURES_DIR):
+    os.makedirs(FIGURES_DIR)
+
+DATASETS = {
+    "sift-128-euclidean": "SIFT-128-Euclidean",
+    "glove-100-angular": "GloVe-100-Angular",
+}
+
+ALGORITHMS = {
+    "jvector": "JVector",
+    "faiss_hnsw": "FAISS HNSW Flat",
+    "faiss_ivf_flat": "FAISS IVF Flat",
+    "faiss_ivf_pq": "FAISS IVF PQ",
+    "faiss_hnsw_pq": "FAISS HNSW PQ",
+}
+
+
+def parse_markdown_table(file_path):
+    with open(file_path, "r") as f:
+        lines = f.readlines()
+
+    # Find the table
+    table_lines = [line.strip() for line in lines if line.strip().startswith("|")]
+
+    if not table_lines:
+        return pd.DataFrame()
+
+    # Parse header
+    header = [c.strip() for c in table_lines[0].strip("|").split("|")]
+
+    # Parse rows
+    data = []
+    for line in table_lines[2:]:  # Skip header and separator
+        values = [c.strip() for c in line.strip("|").split("|")]
+        if len(values) != len(header):
+            continue
+        row = dict(zip(header, values))
+        data.append(row)
+
+    df = pd.DataFrame(data)
+
+    # Convert numeric columns
+    for col in df.columns:
+        try:
+            df[col] = pd.to_numeric(df[col])
+        except ValueError:
+            pass
+
+    return df
+
+
+def get_peak_memory(log_file):
+    if not os.path.exists(log_file):
+        return None
+    try:
+        df = pd.read_csv(log_file)
+        return df["RSS_MB"].max()
+    except Exception as e:
+        print(f"Error reading log file {log_file}: {e}")
+        return None
+
+
+def plot_dataset(dataset_key, dataset_name):
+    plt.figure(figsize=(12, 8))
+
+    # Find all result files for this dataset
+    # Pattern: benchmark_{algo}_{dataset}_{params}.md
+    # But the filenames are like: benchmark_jvector_sift-128-euclidean_size_full.md
+    # benchmark_faiss_sift-128-euclidean_hnsw_full.md
+
+    # We need to map filename patterns to algorithms
+
+    files = glob.glob(os.path.join(RESULTS_DIR, f"*{dataset_key}*.md"))
+
+    colors = plt.cm.tab10.colors
+
+    plot_data = []
+
+    for file_path in files:
+        filename = os.path.basename(file_path)
+
+        # Determine algorithm
+        algo_name = "Unknown"
+        is_jvector = False
+
+        if "jvector" in filename:
+            algo_name = "JVector"
+            is_jvector = True
+            log_pattern = f"jvector-*-full_*_memory.log"  # Simplified pattern matching
+            # Need to be more specific to match dataset
+            if "euclidean" in dataset_key:
+                log_pattern = "jvector-euclidean-full_*_memory.log"
+            else:
+                log_pattern = "jvector-angular-full_*_memory.log"
+
+        elif "faiss" in filename:
+            dataset_type = "euclidean" if "euclidean" in dataset_key else "angular"
+
+            if "hnsw_pq" in filename:
+                algo_name = "FAISS HNSW PQ"
+                log_pattern = f"faiss-{dataset_type}-hnsw_pq-full_*_memory.log"
+            elif "ivf_pq" in filename:
+                algo_name = "FAISS IVF PQ"
+                log_pattern = f"faiss-{dataset_type}-ivf_pq-full_*_memory.log"
+            elif "ivf_flat" in filename:
+                algo_name = "FAISS IVF Flat"
+                log_pattern = f"faiss-{dataset_type}-ivf_flat-full_*_memory.log"
+            elif "hnsw" in filename:  # Check this last as hnsw_pq contains hnsw
+                algo_name = "FAISS HNSW Flat"
+                log_pattern = f"faiss-{dataset_type}-hnsw-full_*_memory.log"
+
+        df = parse_markdown_table(file_path)
+        if df.empty:
+            continue
+
+        # Get Peak Memory
+        log_files = glob.glob(os.path.join(LOGS_DIR, log_pattern))
+        peak_mem = "N/A"
+        if log_files:
+            # Pick the most recent one or just the first one
+            log_file = sorted(log_files)[-1]
+            mem = get_peak_memory(log_file)
+            if mem:
+                peak_mem = f"{mem:.0f} MB"
+
+        # Calculate Build Time
+        # For JVector: Build + Warmup (Before)
+        # For FAISS: Build
+        # We take the mean build time if there are multiple rows, or just the first one
+        # since build time is per index Actually, build time is constant for the same
+        # build parameters. But here we might have different build parameters in the
+        # same file (e.g. JVector has max_connections, beam_width). So Build Time
+        # varies. We can't put a single Build Time in the legend if it varies. Let's
+        # check if it varies significantly. For FAISS HNSW, Build Time depends on M and
+        # efConstruction. So it varies. So we can't put it in the legend easily unless
+        # we average it or show a range. Or maybe just "Avg Build: ...".
+
+        if is_jvector:
+            df["Total Build"] = 0.0
+            if "Build (s)" in df.columns:
+                df["Total Build"] += df["Build (s)"]
+            if "Warmup (s) (Before)" in df.columns:
+                df["Total Build"] += df["Warmup (s) (Before)"]
+        else:
+            df["Total Build"] = 0.0
+            if "Build (s)" in df.columns:
+                df["Total Build"] += df["Build (s)"]
+
+        avg_build = df["Total Build"].mean()
+        build_str = f"{avg_build:.1f}s"
+
+        # Prepare data for plotting
+        # We want the Pareto frontier (best recall for given latency or best latency for
+        # given recall)
+        # But simply plotting all points is also fine to see the spread.
+        # Usually benchmarks plot the line connecting the best points.
+
+        # Sort by Recall
+        df = df.sort_values(by="Recall (Before)")
+
+        recall = df["Recall (Before)"]
+        latency = df["Latency (ms) (Before)"]
+
+        # Filter out very bad points if necessary, but let's plot all first.
+
+        label = f"{algo_name} (Mem: {peak_mem}, Build: ~{build_str})"
+
+        print(f"  {algo_name}: Peak Mem={peak_mem}, Avg Build={build_str}")
+
+        plot_data.append(
+            {
+                "recall": recall,
+                "latency": latency,
+                "label": label,
+                "avg_build": avg_build,
+            }
+        )
+
+    # Sort by avg_build descending
+    plot_data.sort(key=lambda x: x["avg_build"], reverse=True)
+
+    for data in plot_data:
+        # Plot points
+        plt.plot(
+            data["recall"],
+            data["latency"],
+            "o",
+            label=data["label"],
+            markersize=8,
+            alpha=0.7,
+        )
+
+    # Add dashed lines for Recall
+    for r in [0.80, 0.85, 0.90, 0.95]:
+        plt.axvline(x=r, color="gray", linestyle="--", alpha=0.5)
+        plt.text(
+            r, plt.ylim()[1] * 0.01, f"{r}", rotation=90, verticalalignment="bottom"
+        )
+
+    plt.title(f"Recall vs Latency - {dataset_name} Dataset")
+    plt.xlabel("Recall@10 (Higher is Better)")
+    plt.ylabel("Latency (ms) (Lower is Better)")
+    plt.grid(True, which="both", ls="-", alpha=0.2)
+    plt.legend()
+    plt.yscale("log")  # Latency often spans orders of magnitude
+
+    # Save plot
+    output_png = os.path.join(FIGURES_DIR, f"plot_{dataset_key}.png")
+    output_pdf = os.path.join(FIGURES_DIR, f"plot_{dataset_key}.pdf")
+    plt.savefig(output_png)
+    plt.savefig(output_pdf)
+    print(f"Saved plots to {output_png} and {output_pdf}")
+
+
+def main():
+    for key, name in DATASETS.items():
+        print(f"Processing {name} dataset...")
+        plot_dataset(key, name)
+
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
`@@ -266,7 +266,7 @@ index = db.create_vector_index(`
`266`	`266`	`vector_property="embedding",`
`267`	`267`	`dimensions=384,`
`268`	`268`	`distance_function="cosine",`
`269`		`- max_connections=16,`
	`269`	`+ max_connections=32,`
`270`	`270`	`beam_width=200 # Higher for better recall`
`271`	`271`	`)`
`272`	`272`