humemai
diff --git a/‎bindings/python/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎bindings/python/.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎bindings/python/Dockerfile.build‎
Lines changed: 21 additions & 3 deletions b/‎bindings/python/Dockerfile.build‎
Lines changed: 21 additions & 3 deletions
diff --git a/‎bindings/python/build.sh‎
Lines changed: 60 additions & 6 deletions b/‎bindings/python/build.sh‎
Lines changed: 60 additions & 6 deletions
diff --git a/‎bindings/python/build_and_install_locally.sh‎
Lines changed: 53 additions & 0 deletions b/‎bindings/python/build_and_install_locally.sh‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎bindings/python/examples/.gitignore‎
Lines changed: 0 additions & 3 deletions b/‎bindings/python/examples/.gitignore‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎bindings/python/examples/benchmark-vector/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎bindings/python/examples/benchmark-vector/.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎bindings/python/examples/benchmark-vector/README.md‎
Lines changed: 0 additions & 151 deletions b/‎bindings/python/examples/benchmark-vector/README.md‎
Lines changed: 0 additions & 151 deletions
@@ -39,3 +39,6 @@ README.old.md
 
 # Jupyter Notebooks for internal testing
 notebooks/
+
+# local built jars
+local-jars/
@@ -13,6 +13,9 @@ ARG PACKAGE_NAME=arcadedb-embedded
 ARG PACKAGE_DESCRIPTION="ArcadeDB embedded multi-model database with bundled JRE - no Java installation required"
 ARG ARCADEDB_TAG
 ARG TARGET_PLATFORM=linux-x64
+# When set to 1, prefer jars provided in bindings/python/local-jars/lib from the build context.
+# If no local jars are present, the build fails fast to avoid silently falling back.
+ARG USE_LOCAL_JARS=0
 
 # Stage 1: Use prebuilt ArcadeDB image to obtain compiled JARs
 # JARs are filtered based on jar_exclusions.txt in later stages
@@ -24,12 +27,27 @@ FROM arcadedata/arcadedb:${ARCADEDB_TAG} AS java-builder
 FROM eclipse-temurin:25-jdk-jammy AS jre-builder
 
 ARG TARGET_PLATFORM
+ARG USE_LOCAL_JARS
 
 WORKDIR /build
 
-# Copy JARs from ArcadeDB image
-RUN mkdir -p /build/jars
-COPY --from=java-builder /home/arcadedb/lib /build/jars/
+# Stash upstream jars from the ArcadeDB image
+RUN mkdir -p /build/upstream-jars /build/jars /build/local-jars
+COPY --from=java-builder /home/arcadedb/lib /build/upstream-jars/
+
+# Optionally bring in locally built jars from the repo (bindings/python/local-jars/lib)
+COPY bindings/python/local-jars/lib/ /build/local-jars/
+
+# Select jar source: local when requested and available; otherwise fall back to upstream image
+RUN if [ "$USE_LOCAL_JARS" = "1" ]; then \
+            if [ -d /build/local-jars ] && [ "$(ls -1 /build/local-jars | wc -l)" -gt 0 ]; then \
+                echo "📦 Using local jars from build context" && cp /build/local-jars/* /build/jars/; \
+            else \
+                echo "❌ USE_LOCAL_JARS=1 but no jars found in bindings/python/local-jars/lib" && exit 1; \
+            fi; \
+        else \
+            echo "📦 Using ArcadeDB image jars" && cp /build/upstream-jars/* /build/jars/; \
+        fi
 
 # Copy JAR exclusion list
 COPY bindings/python/jar_exclusions.txt /build/jar_exclusions.txt
 
@@ -1,6 +1,15 @@
 #!/bin/bash
 # ArcadeDB Python Package Build Script
-# Builds arcadedb-embedded package with bundled JRE (no Java installation required)
+# Builds arcadedb-embedded with a bundled JRE (no Java install needed).
+# JAR sourcing is explicit: provide a JAR directory to embed those artifacts;
+# otherwise JARs are pulled from the arcadedata/arcadedb image.
+#
+# Quick local-jar workflow (no host Java install required):
+#   1) Build ArcadeDB JARs in Docker:
+#        docker run --rm -v "$PWD":/src -w /src maven:3.9-eclipse-temurin-25 \
+#          sh -c "git config --global --add safe.directory /src && ./mvnw -DskipTests -pl package -am package"
+#   2) Point the build at your JAR directory:
+#        cd bindings/python && ./build.sh linux/amd64 3.12 package/target/arcadedb-*/lib
 
 set -euo pipefail
 
@@ -15,6 +24,7 @@ NC='\033[0m' # No Color
 # Parse command line arguments
 PLATFORM="${1:-}"
 PYTHON_VERSION="${2:-3.12}"
+JAR_LIB_DIR="${3:-}"
 
 print_header() {
     echo -e "${BLUE}╔════════════════════════════════════════════════════════════╗${NC}"
@@ -24,7 +34,7 @@ print_header() {
 }
 
 print_usage() {
-    echo "Usage: $0 [PLATFORM] [PYTHON_VERSION]"
+    echo "Usage: $0 [PLATFORM] [PYTHON_VERSION] [JAR_LIB_DIR]"
     echo ""
     echo "Builds arcadedb-embedded package with bundled JRE"
     echo "No external Java installation required!"
@@ -42,15 +52,20 @@ print_usage() {
     echo "  Python version for wheel (default: 3.12)"
     echo "  Examples: 3.10, 3.11, 3.12, 3.13, 3.14"
     echo ""
+    echo "JAR_LIB_DIR (optional):"
+    echo "  Directory containing ArcadeDB JARs to embed"
+    echo "  If omitted, JARs are pulled from arcadedata/arcadedb:<version>"
+    echo ""
     echo "Build Methods:"
     echo "  Native: macOS and Windows build natively on their platforms"
     echo "  Docker: Linux uses Docker for manylinux compliance"
     echo ""
     echo "Examples:"
-    echo "  $0                    # Build for current platform with Python 3.12"
-    echo "  $0 linux/amd64        # Build for Linux x86_64 with Python 3.12 (via Docker)"
-    echo "  $0 linux/amd64 3.11   # Build for Linux x86_64 with Python 3.11 (via Docker)"
-    echo "  $0 darwin/arm64 3.12  # Build for macOS ARM64 with Python 3.12 (native)"
+    echo "  $0                                    # Build for current platform with Python 3.12"
+    echo "  $0 linux/amd64                        # Build for Linux x86_64 with Python 3.12 (Docker)"
+    echo "  $0 linux/amd64 3.11                   # Build for Linux x86_64 with Python 3.11 (Docker)"
+    echo "  $0 linux/amd64 3.12 /path/to/jars     # Build using JARs from /path/to/jars"
+    echo "  $0 darwin/arm64 3.12                  # Build for macOS ARM64 with Python 3.12 (native)"
     echo ""
     echo "Package features:"
     echo "  ✅ Bundled platform-specific JRE (no Java required)"
@@ -116,6 +131,42 @@ DOCKER_TAG=$(python3 "$SCRIPT_DIR/extract_version.py" --format=docker)
 echo -e "${CYAN}📌 Docker tag: ${YELLOW}${DOCKER_TAG}${NC}"
 echo ""
 
+# Select jar source: explicit directory when provided; otherwise pull from ArcadeDB image
+LOCAL_JARS_DIR="$SCRIPT_DIR/local-jars/lib"
+USE_LOCAL_JARS_ARG=""
+JAR_SOURCE_DESC="ArcadeDB image"
+mkdir -p "$LOCAL_JARS_DIR"
+
+if [[ -n "$JAR_LIB_DIR" ]]; then
+    if [[ ! -d "$JAR_LIB_DIR" ]]; then
+        echo -e "${RED}❌ JAR_LIB_DIR not found: ${JAR_LIB_DIR}${NC}"
+        exit 1
+    fi
+
+    if ! find "$JAR_LIB_DIR" -maxdepth 1 -name "*.jar" -print -quit | grep -q .; then
+        echo -e "${RED}❌ No JAR files found in: ${JAR_LIB_DIR}${NC}"
+        exit 1
+    fi
+
+    JAR_LIB_DIR_REAL=$(realpath "$JAR_LIB_DIR")
+    LOCAL_JARS_REAL=$(realpath "$LOCAL_JARS_DIR")
+
+    if [[ "$JAR_LIB_DIR_REAL" == "$LOCAL_JARS_REAL" ]]; then
+        echo -e "${CYAN}📦 Using pre-staged JARs in: ${YELLOW}${JAR_LIB_DIR}${NC}"
+        JAR_SOURCE_DESC="${JAR_LIB_DIR} (pre-staged)"
+    else
+        echo -e "${CYAN}📦 Using provided JAR directory: ${YELLOW}${JAR_LIB_DIR}${NC}"
+        rm -rf "$LOCAL_JARS_DIR"
+        mkdir -p "$LOCAL_JARS_DIR"
+        cp -a "$JAR_LIB_DIR"/*.jar "$LOCAL_JARS_DIR"/
+        JAR_SOURCE_DESC="${JAR_LIB_DIR} (staged into local-jars)"
+    fi
+
+    USE_LOCAL_JARS_ARG="--build-arg USE_LOCAL_JARS=1"
+else
+    echo -e "${CYAN}📦 Using JARs from ArcadeDB image (no JAR_LIB_DIR provided)${NC}"
+fi
+
 # Determine build method: native or Docker
 # Use native build if we're already on the target platform
 CURRENT_OS="$(uname -s)"
@@ -162,6 +213,7 @@ echo -e "${CYAN}📋 Build Configuration:${NC}"
 echo -e "   Package: ${YELLOW}arcadedb-embedded${NC}"
 echo -e "   Platform: ${YELLOW}${PLATFORM}${NC}"
 echo -e "   Python Version: ${YELLOW}${PYTHON_VERSION}${NC}"
+echo -e "   JAR Source: ${YELLOW}${JAR_SOURCE_DESC}${NC}"
 echo -e "   JRE: ${YELLOW}Bundled (end users need no Java)${NC}"
 echo -e "   Build Method: ${YELLOW}${BUILD_METHOD}${NC}"
 echo ""
@@ -219,6 +271,7 @@ else
         --build-arg PACKAGE_DESCRIPTION="$DESCRIPTION" \
         --build-arg ARCADEDB_TAG="$DOCKER_TAG" \
         --build-arg TARGET_PLATFORM="$TARGET_PLATFORM" \
+        $USE_LOCAL_JARS_ARG \
         $BUILD_VERSION_ARG \
         --target export \
         -t arcadedb-python-package-export \
@@ -234,6 +287,7 @@ else
         --build-arg PACKAGE_DESCRIPTION="$DESCRIPTION" \
         --build-arg ARCADEDB_TAG="$DOCKER_TAG" \
         --build-arg TARGET_PLATFORM="$TARGET_PLATFORM" \
+        $USE_LOCAL_JARS_ARG \
         $BUILD_VERSION_ARG \
         --target tester \
         -t arcadedb-python-package \
 
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# End-to-end helper to rebuild ArcadeDB JARs (via Docker Maven),
+# stage them for the Python wheel, build the wheel, and install it with uv.
+# Usage: run from any directory: ./bindings/python/build_and_install.sh
+
+set -euo pipefail
+
+log() {
+    echo "[build] $*" >&2
+}
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
+PY_BINDINGS_DIR="${REPO_ROOT}/bindings/python"
+LOCAL_JARS_DIR="${PY_BINDINGS_DIR}/local-jars/lib"
+
+log "Repo root: ${REPO_ROOT}"
+
+# 1) Build ArcadeDB JARs in Docker (Java 25)
+log "Building ArcadeDB JARs via Docker Maven..."
+docker run --rm \
+    -v "${REPO_ROOT}":/src \
+    -w /src \
+    maven:3.9-eclipse-temurin-25 \
+    sh -c "git config --global --add safe.directory /src && ./mvnw -DskipTests -pl package -am package"
+
+# 2) Copy freshly built JARs into local-jars for the Python build
+log "Staging JARs into bindings/python/local-jars/lib..."
+mkdir -p "${LOCAL_JARS_DIR}"
+HEADLESS_LIB_DIR="${REPO_ROOT}/package/target/arcadedb-26.1.1-SNAPSHOT-headless.dir/arcadedb-26.1.1-SNAPSHOT/lib"
+if [[ ! -d "${HEADLESS_LIB_DIR}" ]]; then
+    echo "❌ Headless package lib directory not found at ${HEADLESS_LIB_DIR}" >&2
+    exit 1
+fi
+cp "${HEADLESS_LIB_DIR}"/*.jar "${LOCAL_JARS_DIR}/"
+log "Copied $(ls -1 "${LOCAL_JARS_DIR}" | wc -l) JARs into local stash"
+
+# 3) Build the Python wheel (defaults to linux/amd64 + Python 3.12) using staged JARs
+log "Building Python wheel with local JARs..."
+cd "${PY_BINDINGS_DIR}"
+./build.sh linux/amd64 3.12 "${LOCAL_JARS_DIR}"
+
+# 4) Install the wheel with uv (force reinstall)
+log "Installing wheel via uv pip..."
+WHEEL_PATH=$(ls -1 dist/arcadedb_embedded-*-linux_x86_64.whl | sort | tail -n 1)
+if [[ -z "${WHEEL_PATH}" ]]; then
+    echo "❌ No wheel found in dist/. Did the build succeed?" >&2
+    exit 1
+fi
+uv pip install --force-reinstall "${WHEEL_PATH}"
+log "Installed ${WHEEL_PATH}"
+
+log "All done."
@@ -10,6 +10,3 @@ data/
 # Benchmark logs
 benchmark_logs/
 benchmark_work/
-
-datasets/
-benchmark-vector/sweep_results*
@@ -0,0 +1,3 @@
+datasets/
+benchmark-vector/sweep_results*
+arcadedb_runs/
@@ -1,151 +0,0 @@
-# Vector Search Benchmark: ArcadeDB (JVector) vs FAISS
-
-This benchmark compares the performance of **ArcadeDB's Vector Index** (based on
-JVector + LSM Tree) against **FAISS** (Facebook AI Similarity Search) using standard ANN
-datasets.
-
-## 1. Algorithms Tested
-
-We evaluated the following vector index implementations:
-
-- **ArcadeDB (JVector + LSM)**: ArcadeDB uses JVector for graph-based vector indexing,
-  integrated with an LSM-tree architecture to provide transactional, persistent, and
-  database-like capabilities. JVector combines the best of **HNSW** (Hierarchical
-  Navigable Small World) and **DiskANN** algorithms to offer high performance on
-  disk-based indexes.
-- **FAISS**: We tested four popular index types:
-  - `HNSW` (Hierarchical Navigable Small World) - Graph-based.
-  - `HNSW_PQ` (HNSW with Product Quantization) - Graph + Compressed.
-  - `IVF_FLAT` (Inverted File with Flat vectors) - Quantization-based.
-  - `IVF_PQ` (Inverted File with Product Quantization) - Compressed.
-
-## 2. Datasets
-
-We used two widely recognized datasets from `ann-benchmarks`:
-
-1.  **SIFT-128-Euclidean**
-
-    - **Vectors**: 1,000,000
-    - **Dimensions**: 128
-    - **Metric**: Euclidean Distance
-    - **Difficulty**: Moderate.
-
-2.  **GloVe-100-Angular**
-    - **Vectors**: ~1.2 Million (1,183,514)
-    - **Dimensions**: 100
-    - **Metric**: Cosine Similarity
-    - **Difficulty**: Hard. As seen in the results, all algorithms achieve lower
-      recall values compared to SIFT for the same parameters.
-
-## 3. Hardware Environment
-
-All benchmarks were executed on the following hardware:
-
-- **CPU**: AMD Ryzen 9 7950X 16-Core Processor
-- **RAM**: 128 GB DDR5 (4×32 GB) at 3600 MT/s (Corsair)
-- **Disk**: Samsung SSD 970 EVO Plus 2TB
-- **GPU**: None (All benchmarks ran on CPU)
-
-## 4. Benchmark Results
-
-The following figures visualize the trade-off between **Recall@10** and **Latency
-(ms)**.
-
-- **X-Axis (Recall)**: Higher is better (Right).
-- **Y-Axis (Latency)**: Lower is better (Down).
-- **Goal**: The ideal performance is in the **bottom-right corner** (High Recall, Low
-  Latency).
-
-Each dot represents a specific configuration (parameter set) for an algorithm. We use
-scatter plots because connecting dots with lines implies a continuum that doesn't
-strictly exist across different discrete parameter combinations (e.g.,
-`max_connections`, `ef_construction`, `nprobe`).
-
-For detailed parameter values and raw metrics, please refer to the markdown files in the
-[`./results/`](./results/) directory.
-
-### Note on Legend Metrics
-
-The legend in the figures displays **Peak Memory** and **Avg Build** time. These metrics
-should be interpreted with the following context:
-
-- **Peak Memory**: This represents the **global maximum RSS** (Resident Set Size)
-  observed during the entire benchmark run for that algorithm. Since the script
-  iterates through multiple parameter configurations (some heavier than others) in a
-  single run, this value reflects the high-water mark of the most resource-intensive
-  configuration, not necessarily the specific memory usage for every data point shown.
-- **Avg Build**: This is the **arithmetic mean** of the build times across all
-  configurations tested for that algorithm. As build time varies significantly with
-  parameters (e.g., `max_connections`, `ef_construction`), this serves as a general
-  ballpark figure rather than a precise measurement for each specific point.
-
-### SIFT-128-Euclidean Results
-
-![SIFT Results](figures/plot_sift-128-euclidean.png)
-_(PDF version: [figures/plot_sift-128-euclidean.pdf](figures/plot_sift-128-euclidean.pdf))_
-
-### GloVe-100-Angular Results
-
-![GloVe Results](figures/plot_glove-100-angular.png)
-_(PDF version: [figures/plot_glove-100-angular.pdf](figures/plot_glove-100-angular.pdf))_
-
-## 5. ArcadeDB Configuration
-
-For ArcadeDB, we selected the following default configuration which offers a balanced
-trade-off between build time, memory usage, and search performance:
-
-```python
-max_connections = 32
-beam_width = 200
-overquery_factor = 16
-```
-
-**Note on Quantization**: No quantization (PQ/SQ) was used for the ArcadeDB JVector
-benchmarks. Quantization support is currently a Work In Progress (WIP) in the core Java
-engine.
-
-**Note on `overquery_factor`**: Unlike FAISS or standard HNSW implementations, JVector
-does not use an `ef` (or `efSearch`) parameter during search. Instead, we implemented an
-**"overquery"** mechanism. This retrieves `k * overquery_factor` candidates from the
-index, sorts them by exact similarity, and returns the top `k`. This allows trading off
-latency for higher recall.
-
-**Note on Build Time (Lazy Indexing)**: JVector employs lazy indexing, meaning the
-initial index object creation is nearly instantaneous. To capture the true cost of
-building the graph, our benchmark includes a "warmup" phase that triggers the actual
-indexing process. The reported **Build Time** for ArcadeDB is calculated as: `Index
-Creation Time + Warmup Time`.
-
-**Note on Memory Usage**: The ArcadeDB benchmark was executed with a JVM heap limit of
-`ARCADEDB_JVM_ARGS='-Xmx16g -Xms16g'`. However, we observed that the actual Resident Set Size
-(RSS) memory consumption exceeded this limit significantly, reaching as high as **41GB**
-in some test cases. This discrepancy suggests significant off-heap memory usage or other
-overheads that require further investigation in the future.
-
-On the **GloVe-100-Angular** dataset (~1.2M vectors), this configuration achieved:
-
-- **Recall@10**: 0.8538
-- **Average Latency**: 36ms
-- **Build Time**: ~530 seconds
-
-We consider this "quite decent" for a persistent, disk-based vector store compared to
-purely in-memory libraries.
-
-## 6. Persistence & Stability Observations
-
-We explicitly tested the persistence of the vector index by closing and reopening the
-ArcadeDB database during the benchmark.
-
-1.  **Persistence Verified**: The index correctly persists to disk. We observed that
-    **query latency remained consistent** before and after reopening the database,
-    confirming that the index structure is preserved and loaded efficiently without
-    needing a rebuild.
-2.  **Recall Stability**:
-    - **GloVe-100-Angular**: Recall values remained identical before and after the
-      database restart, as expected.
-    - **SIFT-128-Euclidean**: We observed a discrepancy in recall values before and
-      after the restart. While usually small, the difference can sometimes be as high
-      as **0.1**. The cause of this non-determinism for the Euclidean metric is
-      currently unknown. However, since our primary production use cases rely on
-      Cosine similarity (Angular), we have decided to deprioritize investigating this
-      specific issue for now.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+datasets/`
	`2`	`+benchmark-vector/sweep_results*`
	`3`	`+arcadedb_runs/`