fix: update JVM memory configuration references from ARCADEDB_JVM_MAX_HEAP to ARCADEDB_JVM_ARGS in documentation and examples

tae898 · tae898 · commit 2a08aa739f69 · 2026-01-01T20:30:37.000+01:00
diff --git a/.github/workflows/test-python-bindings.yml b/.github/workflows/test-python-bindings.yml
@@ -406,6 +406,10 @@ jobs:
             echo "The arcadedb-embedded package has been successfully built and tested." >> $GITHUB_STEP_SUMMARY
             echo "" >> $GITHUB_STEP_SUMMARY
             echo "**Package**: arcadedb-embedded" >> $GITHUB_STEP_SUMMARY
+            echo "" >> $GITHUB_STEP_SUMMARY
+            echo "ℹ️ **Note**: Some platform/Python combinations are excluded from testing:" >> $GITHUB_STEP_SUMMARY
+            echo "- Windows ARM64 + Python 3.10, 3.14 (no GitHub-hosted runners available)" >> $GITHUB_STEP_SUMMARY
+            echo "- macOS x86_64 + Python 3.13, 3.14 (no suitable dependencies available)" >> $GITHUB_STEP_SUMMARY
           else
             echo "❌ **Some platforms failed testing**" >> $GITHUB_STEP_SUMMARY
             echo "" >> $GITHUB_STEP_SUMMARY
diff --git a/.github/workflows/test-python-examples.yml b/.github/workflows/test-python-examples.yml
@@ -238,7 +238,7 @@ jobs:
         shell: bash
         env:
           # Increase JVM heap for large CSV imports (example 04)
-          ARCADEDB_JVM_MAX_HEAP: "8g"
+          ARCADEDB_JVM_ARGS: "-Xmx8g -Xms8g"
         run: |
           cd bindings/python/examples
 
@@ -450,6 +450,10 @@ jobs:
             echo "All examples ran successfully across all 6 platforms." >> $GITHUB_STEP_SUMMARY
             echo "" >> $GITHUB_STEP_SUMMARY
             echo "**Platforms tested**: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64, windows/arm64" >> $GITHUB_STEP_SUMMARY
+            echo "" >> $GITHUB_STEP_SUMMARY
+            echo "ℹ️ **Note**: Some platform/Python combinations are excluded from testing:" >> $GITHUB_STEP_SUMMARY
+            echo "- Windows ARM64 + Python 3.10, 3.14 (no GitHub-hosted runners available)" >> $GITHUB_STEP_SUMMARY
+            echo "- macOS x86_64 + Python 3.13, 3.14 (no suitable dependencies available)" >> $GITHUB_STEP_SUMMARY
           else
             echo "❌ **Some platforms failed example testing**" >> $GITHUB_STEP_SUMMARY
             echo "" >> $GITHUB_STEP_SUMMARY
diff --git a/bindings/python/docs/development/troubleshooting.md b/bindings/python/docs/development/troubleshooting.md
@@ -235,31 +235,136 @@ rm ./mydb/.lock
 
 ---
 
-### Memory Issues
+### Memory Configuration
 
-**Problem**: Out of memory errors
+#### JVM Memory Configuration
 
-**Solutions**:
+Configure JVM memory via the `ARCADEDB_JVM_ARGS` environment variable **before** importing `arcadedb_embedded`:
 
-1. **Increase JVM Heap**:
-   ```python
-   import jpype
+**Basic Configuration:**
 
-   # Set before first import
-   jpype.startJVM("-Xmx4g")  # 4GB heap
+```bash
+# Default: 4GB heap
+python script.py
 
-   import arcadedb_embedded as arcadedb
+# Production: 8GB heap with matching initial size
+export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
+python script.py
+
+# One-liner
+ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python script.py
+```
+
+**Common JVM Options:**
+
+| Option | Description | Example |
+|--------|-------------|----------|
+| `-Xmx<size>` | Maximum heap memory | `-Xmx8g` (8 gigabytes) |
+| `-Xms<size>` | Initial heap size (recommended: same as `-Xmx`) | `-Xms8g` |
+| `-XX:MaxDirectMemorySize=<size>` | Limit off-heap direct buffers | `-XX:MaxDirectMemorySize=8g` |
+| `-Darcadedb.vectorIndex.locationCacheSize=<count>` | Max vector locations to cache (default: -1 = unlimited) | `-Darcadedb.vectorIndex.locationCacheSize=100000` |
+| `-Darcadedb.vectorIndex.graphBuildCacheSize=<count>` | Max vectors cached during HNSW build (default: 10000) | `-Darcadedb.vectorIndex.graphBuildCacheSize=3000` |
+| `-Darcadedb.vectorIndex.mutationsBeforeRebuild=<count>` | Mutations before graph rebuild (default: 100) | `-Darcadedb.vectorIndex.mutationsBeforeRebuild=200` |
+
+**Vector Index Memory Tuning:**
+
+For applications using vector indexes, control memory usage:
+
+```bash
+# Conservative: bounded caches for large vector datasets
+export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g \
+  -Darcadedb.vectorIndex.locationCacheSize=100000 \
+  -Darcadedb.vectorIndex.graphBuildCacheSize=3000"
+python vector_app.py
+```
+
+**Cache Size Guidelines:**
+
+- `locationCacheSize`: Number of vector locations (each ~56 bytes)
+  - 100000 entries ≈ 5.6 MB
+  - -1 = unlimited (backward compatible, may consume unbounded memory)
+  - Recommended: 100000 for datasets with 1M+ vectors
+
+- `graphBuildCacheSize`: Number of vectors during HNSW build
+  - Memory ≈ cacheSize × (dimensions × 4 + 64) bytes
+  - For 768-dim: 10000 entries ≈ 30 MB
+  - Lower values reduce build-time memory spikes
+  - Recommended: 3000-5000 for high-dimensional vectors
+
+**Memory Planning:**
+
+```text
+Total Process Memory = JVM Heap + Off-Heap Components
+
+Off-Heap Components:
+- Direct buffers (MaxDirectMemorySize)
+- Metaspace (class definitions)
+- Page cache
+- Thread stacks
+- Vector index caches (if bounded)
+
+Rule of thumb: Plan for 1.5-2× your heap size in actual RAM
+```
+
+**Example Configurations:**
+
+```bash
+# Small datasets (<1M records, <100K vectors)
+ARCADEDB_JVM_ARGS="-Xmx2g -Xms2g"
+
+# Medium datasets (1M-10M records, 100K-1M vectors)
+ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g"
+
+# Large datasets (10M+ records, 1M+ vectors) with bounded caches
+ARCADEDB_JVM_ARGS="-Xmx16g -Xms16g -XX:MaxDirectMemorySize=16g \
+  -Darcadedb.vectorIndex.locationCacheSize=100000 \
+  -Darcadedb.vectorIndex.graphBuildCacheSize=5000"
+
+# High-dimensional vectors (e.g., 1536-dim embeddings)
+ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g \
+  -Darcadedb.vectorIndex.locationCacheSize=50000 \
+  -Darcadedb.vectorIndex.graphBuildCacheSize=2000"
+```
+
+!!! warning "Configuration Timing"
+    `ARCADEDB_JVM_ARGS` must be set **before** the first `import arcadedb_embedded`. The
+    JVM can only be configured once per Python process.
+
+!!! tip "Alternative: ARCADEDB_JVM_ERROR_FILE"
+    Set crash log location:
+    ```bash
+    export ARCADEDB_JVM_ERROR_FILE="/var/log/arcade/errors.log"
+    ```
+
+#### Out of Memory Errors
+
+**Problem**: `OutOfMemoryError` or heap space errors
+
+**Solutions**:
+
+1. **Increase Heap via Environment Variable** (Recommended):
+   ```bash
+   export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
+   python script.py
+   ```
+
+2. **Bound Vector Caches** (for vector workloads):
+   ```bash
+   export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g \
+     -Darcadedb.vectorIndex.locationCacheSize=100000 \
+     -Darcadedb.vectorIndex.graphBuildCacheSize=3000"
+   python script.py
    ```
 
-2. **Use Batch Processing**:
+3. **Use Batch Processing**:
    ```python
    batch_size = 1000
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        process_batch(batch)
    ```
 
-3. **Close ResultSets**:
+4. **Close ResultSets**:
    ```python
    result = db.query("sql", "SELECT FROM LargeTable")
    try:
diff --git a/bindings/python/docs/examples/04_csv_import_documents.md b/bindings/python/docs/examples/04_csv_import_documents.md
@@ -476,7 +476,7 @@ python 04_csv_import_documents.py --size small
 python 04_csv_import_documents.py --size large
 
 # With custom JVM heap for large datasets
-ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py --size large
+ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py --size large
 ```
 
 **Command-line options:**
diff --git a/bindings/python/docs/examples/05_csv_import_graph.md b/bindings/python/docs/examples/05_csv_import_graph.md
@@ -408,7 +408,7 @@ python 05_csv_import_graph.py --size small --method java --no-async --export
 ### JVM Settings
 
 ```bash
-export ARCADEDB_JVM_MAX_HEAP="8g"
+export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
 export ARCADEDB_JVM_ARGS="-Xms8g"
 ```
 
diff --git a/bindings/python/docs/examples/06_vector_search_recommendations.md b/bindings/python/docs/examples/06_vector_search_recommendations.md
@@ -84,7 +84,7 @@ python 06_vector_search_recommendations.py --help
 
 **Recommendations:**
 - **Setup:** Use fresh copy or import from JSONL to avoid conflicts
-- **Memory:** 8GB JVM heap for large dataset (`ARCADEDB_JVM_MAX_HEAP="8g"`)
+- **Memory:** 8GB JVM heap for large dataset (`ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"`)
 - **Embeddings:** Cached automatically, use `--force-embed` to regenerate
 - **Models:** Both models included for comparison
 
diff --git a/bindings/python/docs/getting-started/installation.md b/bindings/python/docs/getting-started/installation.md
@@ -162,6 +162,37 @@ pip install --force-reinstall arcadedb-embedded \
   --extra-index-url https://pypi.org/simple/
 ```
 
+## JVM Configuration
+
+The bundled JVM can be configured via the `ARCADEDB_JVM_ARGS` environment variable **before** importing `arcadedb_embedded`:
+
+```bash
+# Default (4GB heap)
+python your_script.py
+
+# Custom memory for large datasets
+export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
+python your_script.py
+```
+
+**Common Options:**
+
+JVM arguments use two flag types:
+
+- **`-X` flags**: JVM runtime options (heap, GC, etc.)
+  - `-Xmx<size>`: Maximum heap memory (e.g., `-Xmx8g` for 8GB)
+  - `-Xms<size>`: Initial heap size (recommended: same as `-Xmx`)
+  - `-XX:MaxDirectMemorySize=<size>`: Limit off-heap buffers
+
+- **`-D` flags**: System properties for ArcadeDB configuration
+  - `-Darcadedb.vectorIndex.locationCacheSize=<count>`: Vector location cache limit
+  - `-Darcadedb.vectorIndex.graphBuildCacheSize=<count>`: HNSW build cache limit
+
+!!! warning "Set Before Import"
+    `ARCADEDB_JVM_ARGS` must be set **before** the first `import arcadedb_embedded` in your Python process. The JVM can only be configured once.
+
+For detailed configuration and memory tuning, see [Troubleshooting - Memory Configuration](../development/troubleshooting.md#memory-configuration).
+
 ## Next Steps
 
 - [Quick Start Guide](quickstart.md) - Get started in 5 minutes
diff --git a/bindings/python/examples/04_csv_import_documents.py b/bindings/python/examples/04_csv_import_documents.py
@@ -61,14 +61,14 @@
 4. Run with custom batch size:
    python 04_csv_import_documents.py --batch-size 10000
 5. Run with custom JVM heap, parallel threads, and batch size:
-   ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py --dataset movielens-large --parallel 8 --batch-size 10000
+   ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py --dataset movielens-large --parallel 8 --batch-size 10000
 
 The script will automatically download the dataset if it doesn't exist.
 
 Memory Requirements:
 - Small dataset (~100K ratings): 4GB heap (default) is sufficient
 - Large dataset (~33M ratings): 4GB heap (default) should work, 8GB for safety
-- Very large datasets (100M+ records): Set ARCADEDB_JVM_MAX_HEAP="8g" or higher
+- Very large datasets (100M+ records): Set ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" or higher
 - Must be set BEFORE running the script (before JVM starts)
 
 Dataset Options:
@@ -1001,16 +1001,22 @@ def check_dataset_exists(data_dir):
 print()
 
 # Check JVM heap configuration for large imports
-jvm_heap = os.environ.get("ARCADEDB_JVM_MAX_HEAP")
-if jvm_heap:
-    print(f"💡 JVM Max Heap: {jvm_heap}")
+jvm_args = os.environ.get("ARCADEDB_JVM_ARGS")
+if jvm_args and "-Xmx" in jvm_args:
+    import re
+
+    match = re.search(r"-Xmx(\S+)", jvm_args)
+    heap_size = match.group(1) if match else "unknown"
+    print(f"💡 JVM Max Heap: {heap_size}")
 else:
     print("💡 JVM Max Heap: 4g (default)")
     print("   ℹ️  Using default JVM heap (4g)")
     if args.dataset == "movielens-large":
         print("   💡 For large datasets, you can increase it:")
-        print('      export ARCADEDB_JVM_MAX_HEAP="8g"  # or run with:')
-        print('      ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py')
+        print('      export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"  # or run with:')
+        print(
+            '      ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py'
+        )
 print()
 
 # -----------------------------------------------------------------------------
diff --git a/bindings/python/examples/06_vector_search_recommendations.py b/bindings/python/examples/06_vector_search_recommendations.py
@@ -20,7 +20,7 @@
   - Best for real-time recommendations
 
 For the large dataset (20M ratings), use these environment variables:
-  ARCADEDB_JVM_MAX_HEAP="8g" ARCADEDB_JVM_ARGS="-Xms8g"
+  ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
 
 KNOWN ISSUES: ArcadeDB Bugs and Limitations
 --------------------------------------------
diff --git a/bindings/python/examples/07_stackoverflow_multimodel.py b/bindings/python/examples/07_stackoverflow_multimodel.py
@@ -12,10 +12,10 @@
 to build a comprehensive knowledge graph with semantic search capabilities.
 
 Dataset Options (disk size → recommended JVM heap):
-- stackoverflow-tiny: ~34 MB → 2 GB (ARCADEDB_JVM_MAX_HEAP='2g' ARCADEDB_JVM_ARGS='-Xms2g')
-- stackoverflow-small: ~642 MB → 8 GB (ARCADEDB_JVM_MAX_HEAP='4g' ARCADEDB_JVM_ARGS='-Xms8g')
-- stackoverflow-medium: ~2.9 GB → 32 GB (ARCADEDB_JVM_MAX_HEAP='32g' ARCADEDB_JVM_ARGS='-Xms32g')
-- stackoverflow-large: ~323 GB → 64+ GB (ARCADEDB_JVM_MAX_HEAP='64g' ARCADEDB_JVM_ARGS='-Xms64g')
+- stackoverflow-tiny: ~34 MB → 2 GB (ARCADEDB_JVM_ARGS='-Xmx2g -Xms2g')
+- stackoverflow-small: ~642 MB → 8 GB (ARCADEDB_JVM_ARGS='-Xmx8g -Xms8g')
+- stackoverflow-medium: ~2.9 GB → 32 GB (ARCADEDB_JVM_ARGS='-Xmx32g -Xms32g')
+- stackoverflow-large: ~323 GB → 64+ GB (ARCADEDB_JVM_ARGS='-Xmx64g -Xms64g')
 
 Usage:
     # Phase 1 only (import + index)
@@ -6401,14 +6401,18 @@ def main():
         sys.exit(1)
 
     # Check JVM heap configuration
-    jvm_heap = os.environ.get("ARCADEDB_JVM_MAX_HEAP")
-    if jvm_heap:
-        print(f"💡 JVM Max Heap: {jvm_heap}")
+    jvm_args = os.environ.get("ARCADEDB_JVM_ARGS")
+    if jvm_args and "-Xmx" in jvm_args:
+        import re
+
+        match = re.search(r"-Xmx(\S+)", jvm_args)
+        heap_size = match.group(1) if match else "unknown"
+        print(f"💡 JVM Max Heap: {heap_size}")
     else:
         print("💡 JVM Max Heap: 4g (default)")
         if args.dataset in ["stackoverflow-medium", "stackoverflow-large"]:
             print("   ⚠️  Consider increasing heap for large datasets:")
-            print('      export ARCADEDB_JVM_MAX_HEAP="8g"')
+            print('      export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"')
     print()
 
     # Schema analysis mode
diff --git a/bindings/python/examples/benchmark-vector/README.md b/bindings/python/examples/benchmark-vector/README.md
@@ -117,7 +117,7 @@ indexing process. The reported **Build Time** for ArcadeDB is calculated as: `In
 Creation Time + Warmup Time`.
 
 **Note on Memory Usage**: The ArcadeDB benchmark was executed with a JVM heap limit of
-`ARCADEDB_JVM_MAX_HEAP='16g'`. However, we observed that the actual Resident Set Size
+`ARCADEDB_JVM_ARGS='-Xmx16g -Xms16g'`. However, we observed that the actual Resident Set Size
 (RSS) memory consumption exceeded this limit significantly, reaching as high as **41GB**
 in some test cases. This discrepancy suggests significant off-heap memory usage or other
 overheads that require further investigation in the future.
diff --git a/bindings/python/examples/benchmark-vector/run_with_memory_monitor.sh b/bindings/python/examples/benchmark-vector/run_with_memory_monitor.sh
@@ -6,7 +6,7 @@
 #   ./run_with_memory_monitor.sh <log_prefix> <python_command>
 #
 # Example:
-#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
+#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_ARGS='-Xmx8g -Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
 #
 
 if [ $# -lt 2 ]; then
diff --git a/bindings/python/examples/run_with_memory_monitor.sh b/bindings/python/examples/run_with_memory_monitor.sh
@@ -6,7 +6,7 @@
 #   ./run_with_memory_monitor.sh <log_prefix> <python_command>
 #
 # Example:
-#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_MAX_HEAP='8g' ARCADEDB_JVM_ARGS='-Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
+#   ./run_with_memory_monitor.sh vector_large "ARCADEDB_JVM_ARGS='-Xmx8g -Xms8g' python 06_vector_search_recommendations.py --source-db my_test_databases/movielens_graph_large_db --db-path my_test_databases/movielens_graph_large_db_vectors"
 #
 
 if [ $# -lt 2 ]; then
diff --git a/bindings/python/src/arcadedb_embedded/importer.py b/bindings/python/src/arcadedb_embedded/importer.py
@@ -510,17 +510,22 @@ def _import_using_java(
                     "outofmemoryerror",
                 ]
             ):
-                current_heap = os.environ.get("ARCADEDB_JVM_MAX_HEAP")
-                if current_heap:
-                    heap_msg = f"Current JVM heap: {current_heap}\n"
+                current_args = os.environ.get("ARCADEDB_JVM_ARGS")
+                if current_args and "-Xmx" in current_args:
+                    # Extract heap size from args
+                    import re
+
+                    match = re.search(r"-Xmx(\S+)", current_args)
+                    heap_size = match.group(1) if match else "unknown"
+                    heap_msg = f"Current JVM heap: {heap_size}\n"
                 else:
                     heap_msg = "Current JVM heap: 4g (default)\n"
 
                 raise ArcadeDBError(
                     f"Import failed ({format_type} -> {import_type}): Out of memory.\n"
                     f"{heap_msg}"
                     f"💡 Try increasing heap size with environment variable:\n"
-                    f'   export ARCADEDB_JVM_MAX_HEAP="8g"\n'
+                    f'   export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"\n'
                     f"   Note: Must be set BEFORE running Python (before JVM starts)\n"
                     f"Original error: {e}"
                 ) from e
diff --git a/bindings/python/src/arcadedb_embedded/jvm.py b/bindings/python/src/arcadedb_embedded/jvm.py