Skip to content

Commit 3ae9a11

Browse files
committed
feat: Add benchmarking scripts for ArcadeDB and FAISS with memory monitoring
- Introduced `run_arcadedb_sweep.sh` for benchmarking ArcadeDB with parameter sweeps. - Added `run_faiss_sweep.sh` for benchmarking FAISS with various index configurations. - Removed `run_with_memory_monitor.sh` as it is no longer needed. - Created `summarize_arcadedb_msmarco.py` to generate markdown summaries from ArcadeDB benchmark results. - Created `summarize_faiss_msmarco.py` to generate markdown summaries from FAISS benchmark results. - Enhanced `Database` class to support `add_hierarchy` parameter for vector indexing. - Updated JVM configuration to limit threads for JVector graph construction. - Added tests for `add_hierarchy` parameter in vector index creation. - Updated documentation to reflect new features and changes.
1 parent 02d5292 commit 3ae9a11

38 files changed

Lines changed: 2327 additions & 4128 deletions

bindings/python/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,6 @@ README.old.md
3939

4040
# Jupyter Notebooks for internal testing
4141
notebooks/
42+
43+
# local built jars
44+
local-jars/

bindings/python/Dockerfile.build

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ ARG PACKAGE_NAME=arcadedb-embedded
1313
ARG PACKAGE_DESCRIPTION="ArcadeDB embedded multi-model database with bundled JRE - no Java installation required"
1414
ARG ARCADEDB_TAG
1515
ARG TARGET_PLATFORM=linux-x64
16+
# When set to 1, prefer jars provided in bindings/python/local-jars/lib from the build context.
17+
# If no local jars are present, the build fails fast to avoid silently falling back.
18+
ARG USE_LOCAL_JARS=0
1619

1720
# Stage 1: Use prebuilt ArcadeDB image to obtain compiled JARs
1821
# JARs are filtered based on jar_exclusions.txt in later stages
@@ -24,12 +27,27 @@ FROM arcadedata/arcadedb:${ARCADEDB_TAG} AS java-builder
2427
FROM eclipse-temurin:25-jdk-jammy AS jre-builder
2528

2629
ARG TARGET_PLATFORM
30+
ARG USE_LOCAL_JARS
2731

2832
WORKDIR /build
2933

30-
# Copy JARs from ArcadeDB image
31-
RUN mkdir -p /build/jars
32-
COPY --from=java-builder /home/arcadedb/lib /build/jars/
34+
# Stash upstream jars from the ArcadeDB image
35+
RUN mkdir -p /build/upstream-jars /build/jars /build/local-jars
36+
COPY --from=java-builder /home/arcadedb/lib /build/upstream-jars/
37+
38+
# Optionally bring in locally built jars from the repo (bindings/python/local-jars/lib)
39+
COPY bindings/python/local-jars/lib/ /build/local-jars/
40+
41+
# Select jar source: local when requested and available; otherwise fall back to upstream image
42+
RUN if [ "$USE_LOCAL_JARS" = "1" ]; then \
43+
if [ -d /build/local-jars ] && [ "$(ls -1 /build/local-jars | wc -l)" -gt 0 ]; then \
44+
echo "📦 Using local jars from build context" && cp /build/local-jars/* /build/jars/; \
45+
else \
46+
echo "❌ USE_LOCAL_JARS=1 but no jars found in bindings/python/local-jars/lib" && exit 1; \
47+
fi; \
48+
else \
49+
echo "📦 Using ArcadeDB image jars" && cp /build/upstream-jars/* /build/jars/; \
50+
fi
3351

3452
# Copy JAR exclusion list
3553
COPY bindings/python/jar_exclusions.txt /build/jar_exclusions.txt

bindings/python/build.sh

Lines changed: 60 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,15 @@
11
#!/bin/bash
22
# ArcadeDB Python Package Build Script
3-
# Builds arcadedb-embedded package with bundled JRE (no Java installation required)
3+
# Builds arcadedb-embedded with a bundled JRE (no Java install needed).
4+
# JAR sourcing is explicit: provide a JAR directory to embed those artifacts;
5+
# otherwise JARs are pulled from the arcadedata/arcadedb image.
6+
#
7+
# Quick local-jar workflow (no host Java install required):
8+
# 1) Build ArcadeDB JARs in Docker:
9+
# docker run --rm -v "$PWD":/src -w /src maven:3.9-eclipse-temurin-25 \
10+
# sh -c "git config --global --add safe.directory /src && ./mvnw -DskipTests -pl package -am package"
11+
# 2) Point the build at your JAR directory:
12+
# cd bindings/python && ./build.sh linux/amd64 3.12 package/target/arcadedb-*/lib
413

514
set -euo pipefail
615

@@ -15,6 +24,7 @@ NC='\033[0m' # No Color
1524
# Parse command line arguments
1625
PLATFORM="${1:-}"
1726
PYTHON_VERSION="${2:-3.12}"
27+
JAR_LIB_DIR="${3:-}"
1828

1929
print_header() {
2030
echo -e "${BLUE}╔════════════════════════════════════════════════════════════╗${NC}"
@@ -24,7 +34,7 @@ print_header() {
2434
}
2535

2636
print_usage() {
27-
echo "Usage: $0 [PLATFORM] [PYTHON_VERSION]"
37+
echo "Usage: $0 [PLATFORM] [PYTHON_VERSION] [JAR_LIB_DIR]"
2838
echo ""
2939
echo "Builds arcadedb-embedded package with bundled JRE"
3040
echo "No external Java installation required!"
@@ -42,15 +52,20 @@ print_usage() {
4252
echo " Python version for wheel (default: 3.12)"
4353
echo " Examples: 3.10, 3.11, 3.12, 3.13, 3.14"
4454
echo ""
55+
echo "JAR_LIB_DIR (optional):"
56+
echo " Directory containing ArcadeDB JARs to embed"
57+
echo " If omitted, JARs are pulled from arcadedata/arcadedb:<version>"
58+
echo ""
4559
echo "Build Methods:"
4660
echo " Native: macOS and Windows build natively on their platforms"
4761
echo " Docker: Linux uses Docker for manylinux compliance"
4862
echo ""
4963
echo "Examples:"
50-
echo " $0 # Build for current platform with Python 3.12"
51-
echo " $0 linux/amd64 # Build for Linux x86_64 with Python 3.12 (via Docker)"
52-
echo " $0 linux/amd64 3.11 # Build for Linux x86_64 with Python 3.11 (via Docker)"
53-
echo " $0 darwin/arm64 3.12 # Build for macOS ARM64 with Python 3.12 (native)"
64+
echo " $0 # Build for current platform with Python 3.12"
65+
echo " $0 linux/amd64 # Build for Linux x86_64 with Python 3.12 (Docker)"
66+
echo " $0 linux/amd64 3.11 # Build for Linux x86_64 with Python 3.11 (Docker)"
67+
echo " $0 linux/amd64 3.12 /path/to/jars # Build using JARs from /path/to/jars"
68+
echo " $0 darwin/arm64 3.12 # Build for macOS ARM64 with Python 3.12 (native)"
5469
echo ""
5570
echo "Package features:"
5671
echo " ✅ Bundled platform-specific JRE (no Java required)"
@@ -116,6 +131,42 @@ DOCKER_TAG=$(python3 "$SCRIPT_DIR/extract_version.py" --format=docker)
116131
echo -e "${CYAN}📌 Docker tag: ${YELLOW}${DOCKER_TAG}${NC}"
117132
echo ""
118133

134+
# Select jar source: explicit directory when provided; otherwise pull from ArcadeDB image
135+
LOCAL_JARS_DIR="$SCRIPT_DIR/local-jars/lib"
136+
USE_LOCAL_JARS_ARG=""
137+
JAR_SOURCE_DESC="ArcadeDB image"
138+
mkdir -p "$LOCAL_JARS_DIR"
139+
140+
if [[ -n "$JAR_LIB_DIR" ]]; then
141+
if [[ ! -d "$JAR_LIB_DIR" ]]; then
142+
echo -e "${RED}❌ JAR_LIB_DIR not found: ${JAR_LIB_DIR}${NC}"
143+
exit 1
144+
fi
145+
146+
if ! find "$JAR_LIB_DIR" -maxdepth 1 -name "*.jar" -print -quit | grep -q .; then
147+
echo -e "${RED}❌ No JAR files found in: ${JAR_LIB_DIR}${NC}"
148+
exit 1
149+
fi
150+
151+
JAR_LIB_DIR_REAL=$(realpath "$JAR_LIB_DIR")
152+
LOCAL_JARS_REAL=$(realpath "$LOCAL_JARS_DIR")
153+
154+
if [[ "$JAR_LIB_DIR_REAL" == "$LOCAL_JARS_REAL" ]]; then
155+
echo -e "${CYAN}📦 Using pre-staged JARs in: ${YELLOW}${JAR_LIB_DIR}${NC}"
156+
JAR_SOURCE_DESC="${JAR_LIB_DIR} (pre-staged)"
157+
else
158+
echo -e "${CYAN}📦 Using provided JAR directory: ${YELLOW}${JAR_LIB_DIR}${NC}"
159+
rm -rf "$LOCAL_JARS_DIR"
160+
mkdir -p "$LOCAL_JARS_DIR"
161+
cp -a "$JAR_LIB_DIR"/*.jar "$LOCAL_JARS_DIR"/
162+
JAR_SOURCE_DESC="${JAR_LIB_DIR} (staged into local-jars)"
163+
fi
164+
165+
USE_LOCAL_JARS_ARG="--build-arg USE_LOCAL_JARS=1"
166+
else
167+
echo -e "${CYAN}📦 Using JARs from ArcadeDB image (no JAR_LIB_DIR provided)${NC}"
168+
fi
169+
119170
# Determine build method: native or Docker
120171
# Use native build if we're already on the target platform
121172
CURRENT_OS="$(uname -s)"
@@ -162,6 +213,7 @@ echo -e "${CYAN}📋 Build Configuration:${NC}"
162213
echo -e " Package: ${YELLOW}arcadedb-embedded${NC}"
163214
echo -e " Platform: ${YELLOW}${PLATFORM}${NC}"
164215
echo -e " Python Version: ${YELLOW}${PYTHON_VERSION}${NC}"
216+
echo -e " JAR Source: ${YELLOW}${JAR_SOURCE_DESC}${NC}"
165217
echo -e " JRE: ${YELLOW}Bundled (end users need no Java)${NC}"
166218
echo -e " Build Method: ${YELLOW}${BUILD_METHOD}${NC}"
167219
echo ""
@@ -219,6 +271,7 @@ else
219271
--build-arg PACKAGE_DESCRIPTION="$DESCRIPTION" \
220272
--build-arg ARCADEDB_TAG="$DOCKER_TAG" \
221273
--build-arg TARGET_PLATFORM="$TARGET_PLATFORM" \
274+
$USE_LOCAL_JARS_ARG \
222275
$BUILD_VERSION_ARG \
223276
--target export \
224277
-t arcadedb-python-package-export \
@@ -234,6 +287,7 @@ else
234287
--build-arg PACKAGE_DESCRIPTION="$DESCRIPTION" \
235288
--build-arg ARCADEDB_TAG="$DOCKER_TAG" \
236289
--build-arg TARGET_PLATFORM="$TARGET_PLATFORM" \
290+
$USE_LOCAL_JARS_ARG \
237291
$BUILD_VERSION_ARG \
238292
--target tester \
239293
-t arcadedb-python-package \
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env bash
2+
# End-to-end helper to rebuild ArcadeDB JARs (via Docker Maven),
3+
# stage them for the Python wheel, build the wheel, and install it with uv.
4+
# Usage: run from any directory: ./bindings/python/build_and_install.sh
5+
6+
set -euo pipefail
7+
8+
log() {
9+
echo "[build] $*" >&2
10+
}
11+
12+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
13+
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
14+
PY_BINDINGS_DIR="${REPO_ROOT}/bindings/python"
15+
LOCAL_JARS_DIR="${PY_BINDINGS_DIR}/local-jars/lib"
16+
17+
log "Repo root: ${REPO_ROOT}"
18+
19+
# 1) Build ArcadeDB JARs in Docker (Java 25)
20+
log "Building ArcadeDB JARs via Docker Maven..."
21+
docker run --rm \
22+
-v "${REPO_ROOT}":/src \
23+
-w /src \
24+
maven:3.9-eclipse-temurin-25 \
25+
sh -c "git config --global --add safe.directory /src && ./mvnw -DskipTests -pl package -am package"
26+
27+
# 2) Copy freshly built JARs into local-jars for the Python build
28+
log "Staging JARs into bindings/python/local-jars/lib..."
29+
mkdir -p "${LOCAL_JARS_DIR}"
30+
HEADLESS_LIB_DIR="${REPO_ROOT}/package/target/arcadedb-26.1.1-SNAPSHOT-headless.dir/arcadedb-26.1.1-SNAPSHOT/lib"
31+
if [[ ! -d "${HEADLESS_LIB_DIR}" ]]; then
32+
echo "❌ Headless package lib directory not found at ${HEADLESS_LIB_DIR}" >&2
33+
exit 1
34+
fi
35+
cp "${HEADLESS_LIB_DIR}"/*.jar "${LOCAL_JARS_DIR}/"
36+
log "Copied $(ls -1 "${LOCAL_JARS_DIR}" | wc -l) JARs into local stash"
37+
38+
# 3) Build the Python wheel (defaults to linux/amd64 + Python 3.12) using staged JARs
39+
log "Building Python wheel with local JARs..."
40+
cd "${PY_BINDINGS_DIR}"
41+
./build.sh linux/amd64 3.12 "${LOCAL_JARS_DIR}"
42+
43+
# 4) Install the wheel with uv (force reinstall)
44+
log "Installing wheel via uv pip..."
45+
WHEEL_PATH=$(ls -1 dist/arcadedb_embedded-*-linux_x86_64.whl | sort | tail -n 1)
46+
if [[ -z "${WHEEL_PATH}" ]]; then
47+
echo "❌ No wheel found in dist/. Did the build succeed?" >&2
48+
exit 1
49+
fi
50+
uv pip install --force-reinstall "${WHEEL_PATH}"
51+
log "Installed ${WHEEL_PATH}"
52+
53+
log "All done."

bindings/python/examples/.gitignore

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,3 @@ data/
1010
# Benchmark logs
1111
benchmark_logs/
1212
benchmark_work/
13-
14-
datasets/
15-
benchmark-vector/sweep_results*
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
datasets/
2+
benchmark-vector/sweep_results*
3+
arcadedb_runs/
Lines changed: 0 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -1,151 +0,0 @@
1-
# Vector Search Benchmark: ArcadeDB (JVector) vs FAISS
2-
3-
This benchmark compares the performance of **ArcadeDB's Vector Index** (based on
4-
JVector + LSM Tree) against **FAISS** (Facebook AI Similarity Search) using standard ANN
5-
datasets.
6-
7-
## 1. Algorithms Tested
8-
9-
We evaluated the following vector index implementations:
10-
11-
- **ArcadeDB (JVector + LSM)**: ArcadeDB uses JVector for graph-based vector indexing,
12-
integrated with an LSM-tree architecture to provide transactional, persistent, and
13-
database-like capabilities. JVector combines the best of **HNSW** (Hierarchical
14-
Navigable Small World) and **DiskANN** algorithms to offer high performance on
15-
disk-based indexes.
16-
- **FAISS**: We tested four popular index types:
17-
- `HNSW` (Hierarchical Navigable Small World) - Graph-based.
18-
- `HNSW_PQ` (HNSW with Product Quantization) - Graph + Compressed.
19-
- `IVF_FLAT` (Inverted File with Flat vectors) - Quantization-based.
20-
- `IVF_PQ` (Inverted File with Product Quantization) - Compressed.
21-
22-
## 2. Datasets
23-
24-
We used two widely recognized datasets from `ann-benchmarks`:
25-
26-
1. **SIFT-128-Euclidean**
27-
28-
- **Vectors**: 1,000,000
29-
- **Dimensions**: 128
30-
- **Metric**: Euclidean Distance
31-
- **Difficulty**: Moderate.
32-
33-
2. **GloVe-100-Angular**
34-
- **Vectors**: ~1.2 Million (1,183,514)
35-
- **Dimensions**: 100
36-
- **Metric**: Cosine Similarity
37-
- **Difficulty**: Hard. As seen in the results, all algorithms achieve lower
38-
recall values compared to SIFT for the same parameters.
39-
40-
## 3. Hardware Environment
41-
42-
All benchmarks were executed on the following hardware:
43-
44-
- **CPU**: AMD Ryzen 9 7950X 16-Core Processor
45-
- **RAM**: 128 GB DDR5 (4×32 GB) at 3600 MT/s (Corsair)
46-
- **Disk**: Samsung SSD 970 EVO Plus 2TB
47-
- **GPU**: None (All benchmarks ran on CPU)
48-
49-
## 4. Benchmark Results
50-
51-
The following figures visualize the trade-off between **Recall@10** and **Latency
52-
(ms)**.
53-
54-
- **X-Axis (Recall)**: Higher is better (Right).
55-
- **Y-Axis (Latency)**: Lower is better (Down).
56-
- **Goal**: The ideal performance is in the **bottom-right corner** (High Recall, Low
57-
Latency).
58-
59-
Each dot represents a specific configuration (parameter set) for an algorithm. We use
60-
scatter plots because connecting dots with lines implies a continuum that doesn't
61-
strictly exist across different discrete parameter combinations (e.g.,
62-
`max_connections`, `ef_construction`, `nprobe`).
63-
64-
For detailed parameter values and raw metrics, please refer to the markdown files in the
65-
[`./results/`](./results/) directory.
66-
67-
### Note on Legend Metrics
68-
69-
The legend in the figures displays **Peak Memory** and **Avg Build** time. These metrics
70-
should be interpreted with the following context:
71-
72-
- **Peak Memory**: This represents the **global maximum RSS** (Resident Set Size)
73-
observed during the entire benchmark run for that algorithm. Since the script
74-
iterates through multiple parameter configurations (some heavier than others) in a
75-
single run, this value reflects the high-water mark of the most resource-intensive
76-
configuration, not necessarily the specific memory usage for every data point shown.
77-
- **Avg Build**: This is the **arithmetic mean** of the build times across all
78-
configurations tested for that algorithm. As build time varies significantly with
79-
parameters (e.g., `max_connections`, `ef_construction`), this serves as a general
80-
ballpark figure rather than a precise measurement for each specific point.
81-
82-
### SIFT-128-Euclidean Results
83-
84-
![SIFT Results](figures/plot_sift-128-euclidean.png)
85-
_(PDF version: [figures/plot_sift-128-euclidean.pdf](figures/plot_sift-128-euclidean.pdf))_
86-
87-
### GloVe-100-Angular Results
88-
89-
![GloVe Results](figures/plot_glove-100-angular.png)
90-
_(PDF version: [figures/plot_glove-100-angular.pdf](figures/plot_glove-100-angular.pdf))_
91-
92-
## 5. ArcadeDB Configuration
93-
94-
For ArcadeDB, we selected the following default configuration which offers a balanced
95-
trade-off between build time, memory usage, and search performance:
96-
97-
```python
98-
max_connections = 32
99-
beam_width = 200
100-
overquery_factor = 16
101-
```
102-
103-
**Note on Quantization**: No quantization (PQ/SQ) was used for the ArcadeDB JVector
104-
benchmarks. Quantization support is currently a Work In Progress (WIP) in the core Java
105-
engine.
106-
107-
**Note on `overquery_factor`**: Unlike FAISS or standard HNSW implementations, JVector
108-
does not use an `ef` (or `efSearch`) parameter during search. Instead, we implemented an
109-
**"overquery"** mechanism. This retrieves `k * overquery_factor` candidates from the
110-
index, sorts them by exact similarity, and returns the top `k`. This allows trading off
111-
latency for higher recall.
112-
113-
**Note on Build Time (Lazy Indexing)**: JVector employs lazy indexing, meaning the
114-
initial index object creation is nearly instantaneous. To capture the true cost of
115-
building the graph, our benchmark includes a "warmup" phase that triggers the actual
116-
indexing process. The reported **Build Time** for ArcadeDB is calculated as: `Index
117-
Creation Time + Warmup Time`.
118-
119-
**Note on Memory Usage**: The ArcadeDB benchmark was executed with a JVM heap limit of
120-
`ARCADEDB_JVM_ARGS='-Xmx16g -Xms16g'`. However, we observed that the actual Resident Set Size
121-
(RSS) memory consumption exceeded this limit significantly, reaching as high as **41GB**
122-
in some test cases. This discrepancy suggests significant off-heap memory usage or other
123-
overheads that require further investigation in the future.
124-
125-
On the **GloVe-100-Angular** dataset (~1.2M vectors), this configuration achieved:
126-
127-
- **Recall@10**: 0.8538
128-
- **Average Latency**: 36ms
129-
- **Build Time**: ~530 seconds
130-
131-
We consider this "quite decent" for a persistent, disk-based vector store compared to
132-
purely in-memory libraries.
133-
134-
## 6. Persistence & Stability Observations
135-
136-
We explicitly tested the persistence of the vector index by closing and reopening the
137-
ArcadeDB database during the benchmark.
138-
139-
1. **Persistence Verified**: The index correctly persists to disk. We observed that
140-
**query latency remained consistent** before and after reopening the database,
141-
confirming that the index structure is preserved and loaded efficiently without
142-
needing a rebuild.
143-
2. **Recall Stability**:
144-
- **GloVe-100-Angular**: Recall values remained identical before and after the
145-
database restart, as expected.
146-
- **SIFT-128-Euclidean**: We observed a discrepancy in recall values before and
147-
after the restart. While usually small, the difference can sometimes be as high
148-
as **0.1**. The cause of this non-determinism for the Euclidean metric is
149-
currently unknown. However, since our primary production use cases rely on
150-
Cosine similarity (Angular), we have decided to deprioritize investigating this
151-
specific issue for now.

0 commit comments

Comments
 (0)