Skip to content

Commit 2a08aa7

Browse files
committed
fix: update JVM memory configuration references from ARCADEDB_JVM_MAX_HEAP to ARCADEDB_JVM_ARGS in documentation and examples
1 parent 9c86d85 commit 2a08aa7

15 files changed

Lines changed: 220 additions & 78 deletions

.github/workflows/test-python-bindings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,10 @@ jobs:
406406
echo "The arcadedb-embedded package has been successfully built and tested." >> $GITHUB_STEP_SUMMARY
407407
echo "" >> $GITHUB_STEP_SUMMARY
408408
echo "**Package**: arcadedb-embedded" >> $GITHUB_STEP_SUMMARY
409+
echo "" >> $GITHUB_STEP_SUMMARY
410+
echo "ℹ️ **Note**: Some platform/Python combinations are excluded from testing:" >> $GITHUB_STEP_SUMMARY
411+
echo "- Windows ARM64 + Python 3.10, 3.14 (no GitHub-hosted runners available)" >> $GITHUB_STEP_SUMMARY
412+
echo "- macOS x86_64 + Python 3.13, 3.14 (no suitable dependencies available)" >> $GITHUB_STEP_SUMMARY
409413
else
410414
echo "❌ **Some platforms failed testing**" >> $GITHUB_STEP_SUMMARY
411415
echo "" >> $GITHUB_STEP_SUMMARY

.github/workflows/test-python-examples.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ jobs:
238238
shell: bash
239239
env:
240240
# Increase JVM heap for large CSV imports (example 04)
241-
ARCADEDB_JVM_MAX_HEAP: "8g"
241+
ARCADEDB_JVM_ARGS: "-Xmx8g -Xms8g"
242242
run: |
243243
cd bindings/python/examples
244244
@@ -450,6 +450,10 @@ jobs:
450450
echo "All examples ran successfully across all 6 platforms." >> $GITHUB_STEP_SUMMARY
451451
echo "" >> $GITHUB_STEP_SUMMARY
452452
echo "**Platforms tested**: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64, windows/arm64" >> $GITHUB_STEP_SUMMARY
453+
echo "" >> $GITHUB_STEP_SUMMARY
454+
echo "ℹ️ **Note**: Some platform/Python combinations are excluded from testing:" >> $GITHUB_STEP_SUMMARY
455+
echo "- Windows ARM64 + Python 3.10, 3.14 (no GitHub-hosted runners available)" >> $GITHUB_STEP_SUMMARY
456+
echo "- macOS x86_64 + Python 3.13, 3.14 (no suitable dependencies available)" >> $GITHUB_STEP_SUMMARY
453457
else
454458
echo "❌ **Some platforms failed example testing**" >> $GITHUB_STEP_SUMMARY
455459
echo "" >> $GITHUB_STEP_SUMMARY

bindings/python/docs/development/troubleshooting.md

Lines changed: 116 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -235,31 +235,136 @@ rm ./mydb/.lock
235235

236236
---
237237

238-
### Memory Issues
238+
### Memory Configuration
239239

240-
**Problem**: Out of memory errors
240+
#### JVM Memory Configuration
241241

242-
**Solutions**:
242+
Configure JVM memory via the `ARCADEDB_JVM_ARGS` environment variable **before** importing `arcadedb_embedded`:
243243

244-
1. **Increase JVM Heap**:
245-
```python
246-
import jpype
244+
**Basic Configuration:**
247245

248-
# Set before first import
249-
jpype.startJVM("-Xmx4g") # 4GB heap
246+
```bash
247+
# Default: 4GB heap
248+
python script.py
250249

251-
import arcadedb_embedded as arcadedb
250+
# Production: 8GB heap with matching initial size
251+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
252+
python script.py
253+
254+
# One-liner
255+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python script.py
256+
```
257+
258+
**Common JVM Options:**
259+
260+
| Option | Description | Example |
261+
|--------|-------------|----------|
262+
| `-Xmx<size>` | Maximum heap memory | `-Xmx8g` (8 gigabytes) |
263+
| `-Xms<size>` | Initial heap size (recommended: same as `-Xmx`) | `-Xms8g` |
264+
| `-XX:MaxDirectMemorySize=<size>` | Limit off-heap direct buffers | `-XX:MaxDirectMemorySize=8g` |
265+
| `-Darcadedb.vectorIndex.locationCacheSize=<count>` | Max vector locations to cache (default: -1 = unlimited) | `-Darcadedb.vectorIndex.locationCacheSize=100000` |
266+
| `-Darcadedb.vectorIndex.graphBuildCacheSize=<count>` | Max vectors cached during HNSW build (default: 10000) | `-Darcadedb.vectorIndex.graphBuildCacheSize=3000` |
267+
| `-Darcadedb.vectorIndex.mutationsBeforeRebuild=<count>` | Mutations before graph rebuild (default: 100) | `-Darcadedb.vectorIndex.mutationsBeforeRebuild=200` |
268+
269+
**Vector Index Memory Tuning:**
270+
271+
For applications using vector indexes, control memory usage:
272+
273+
```bash
274+
# Conservative: bounded caches for large vector datasets
275+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g \
276+
-Darcadedb.vectorIndex.locationCacheSize=100000 \
277+
-Darcadedb.vectorIndex.graphBuildCacheSize=3000"
278+
python vector_app.py
279+
```
280+
281+
**Cache Size Guidelines:**
282+
283+
- `locationCacheSize`: Number of vector locations (each ~56 bytes)
284+
- 100000 entries ≈ 5.6 MB
285+
- -1 = unlimited (backward compatible, may consume unbounded memory)
286+
- Recommended: 100000 for datasets with 1M+ vectors
287+
288+
- `graphBuildCacheSize`: Number of vectors during HNSW build
289+
- Memory ≈ cacheSize × (dimensions × 4 + 64) bytes
290+
- For 768-dim: 10000 entries ≈ 30 MB
291+
- Lower values reduce build-time memory spikes
292+
- Recommended: 3000-5000 for high-dimensional vectors
293+
294+
**Memory Planning:**
295+
296+
```text
297+
Total Process Memory = JVM Heap + Off-Heap Components
298+
299+
Off-Heap Components:
300+
- Direct buffers (MaxDirectMemorySize)
301+
- Metaspace (class definitions)
302+
- Page cache
303+
- Thread stacks
304+
- Vector index caches (if bounded)
305+
306+
Rule of thumb: Plan for 1.5-2× your heap size in actual RAM
307+
```
308+
309+
**Example Configurations:**
310+
311+
```bash
312+
# Small datasets (<1M records, <100K vectors)
313+
ARCADEDB_JVM_ARGS="-Xmx2g -Xms2g"
314+
315+
# Medium datasets (1M-10M records, 100K-1M vectors)
316+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g"
317+
318+
# Large datasets (10M+ records, 1M+ vectors) with bounded caches
319+
ARCADEDB_JVM_ARGS="-Xmx16g -Xms16g -XX:MaxDirectMemorySize=16g \
320+
-Darcadedb.vectorIndex.locationCacheSize=100000 \
321+
-Darcadedb.vectorIndex.graphBuildCacheSize=5000"
322+
323+
# High-dimensional vectors (e.g., 1536-dim embeddings)
324+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g -XX:MaxDirectMemorySize=8g \
325+
-Darcadedb.vectorIndex.locationCacheSize=50000 \
326+
-Darcadedb.vectorIndex.graphBuildCacheSize=2000"
327+
```
328+
329+
!!! warning "Configuration Timing"
330+
`ARCADEDB_JVM_ARGS` must be set **before** the first `import arcadedb_embedded`. The
331+
JVM can only be configured once per Python process.
332+
333+
!!! tip "Alternative: ARCADEDB_JVM_ERROR_FILE"
334+
Set crash log location:
335+
```bash
336+
export ARCADEDB_JVM_ERROR_FILE="/var/log/arcade/errors.log"
337+
```
338+
339+
#### Out of Memory Errors
340+
341+
**Problem**: `OutOfMemoryError` or heap space errors
342+
343+
**Solutions**:
344+
345+
1. **Increase Heap via Environment Variable** (Recommended):
346+
```bash
347+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
348+
python script.py
349+
```
350+
351+
2. **Bound Vector Caches** (for vector workloads):
352+
```bash
353+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g \
354+
-Darcadedb.vectorIndex.locationCacheSize=100000 \
355+
-Darcadedb.vectorIndex.graphBuildCacheSize=3000"
356+
python script.py
252357
```
253358

254-
2. **Use Batch Processing**:
359+
3. **Use Batch Processing**:
255360
```python
256361
batch_size = 1000
257362
for i in range(0, len(data), batch_size):
258363
batch = data[i:i + batch_size]
259364
process_batch(batch)
260365
```
261366

262-
3. **Close ResultSets**:
367+
4. **Close ResultSets**:
263368
```python
264369
result = db.query("sql", "SELECT FROM LargeTable")
265370
try:

bindings/python/docs/examples/04_csv_import_documents.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -476,7 +476,7 @@ python 04_csv_import_documents.py --size small
476476
python 04_csv_import_documents.py --size large
477477

478478
# With custom JVM heap for large datasets
479-
ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py --size large
479+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py --size large
480480
```
481481

482482
**Command-line options:**

bindings/python/docs/examples/05_csv_import_graph.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -408,7 +408,7 @@ python 05_csv_import_graph.py --size small --method java --no-async --export
408408
### JVM Settings
409409

410410
```bash
411-
export ARCADEDB_JVM_MAX_HEAP="8g"
411+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
412412
export ARCADEDB_JVM_ARGS="-Xms8g"
413413
```
414414

bindings/python/docs/examples/06_vector_search_recommendations.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ python 06_vector_search_recommendations.py --help
8484

8585
**Recommendations:**
8686
- **Setup:** Use fresh copy or import from JSONL to avoid conflicts
87-
- **Memory:** 8GB JVM heap for large dataset (`ARCADEDB_JVM_MAX_HEAP="8g"`)
87+
- **Memory:** 8GB JVM heap for large dataset (`ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"`)
8888
- **Embeddings:** Cached automatically, use `--force-embed` to regenerate
8989
- **Models:** Both models included for comparison
9090

bindings/python/docs/getting-started/installation.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,37 @@ pip install --force-reinstall arcadedb-embedded \
162162
--extra-index-url https://pypi.org/simple/
163163
```
164164

165+
## JVM Configuration
166+
167+
The bundled JVM can be configured via the `ARCADEDB_JVM_ARGS` environment variable **before** importing `arcadedb_embedded`:
168+
169+
```bash
170+
# Default (4GB heap)
171+
python your_script.py
172+
173+
# Custom memory for large datasets
174+
export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
175+
python your_script.py
176+
```
177+
178+
**Common Options:**
179+
180+
JVM arguments use two flag types:
181+
182+
- **`-X` flags**: JVM runtime options (heap, GC, etc.)
183+
- `-Xmx<size>`: Maximum heap memory (e.g., `-Xmx8g` for 8GB)
184+
- `-Xms<size>`: Initial heap size (recommended: same as `-Xmx`)
185+
- `-XX:MaxDirectMemorySize=<size>`: Limit off-heap buffers
186+
187+
- **`-D` flags**: System properties for ArcadeDB configuration
188+
- `-Darcadedb.vectorIndex.locationCacheSize=<count>`: Vector location cache limit
189+
- `-Darcadedb.vectorIndex.graphBuildCacheSize=<count>`: HNSW build cache limit
190+
191+
!!! warning "Set Before Import"
192+
`ARCADEDB_JVM_ARGS` must be set **before** the first `import arcadedb_embedded` in your Python process. The JVM can only be configured once.
193+
194+
For detailed configuration and memory tuning, see [Troubleshooting - Memory Configuration](../development/troubleshooting.md#memory-configuration).
195+
165196
## Next Steps
166197

167198
- [Quick Start Guide](quickstart.md) - Get started in 5 minutes

bindings/python/examples/04_csv_import_documents.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -61,14 +61,14 @@
6161
4. Run with custom batch size:
6262
python 04_csv_import_documents.py --batch-size 10000
6363
5. Run with custom JVM heap, parallel threads, and batch size:
64-
ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py --dataset movielens-large --parallel 8 --batch-size 10000
64+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py --dataset movielens-large --parallel 8 --batch-size 10000
6565
6666
The script will automatically download the dataset if it doesn't exist.
6767
6868
Memory Requirements:
6969
- Small dataset (~100K ratings): 4GB heap (default) is sufficient
7070
- Large dataset (~33M ratings): 4GB heap (default) should work, 8GB for safety
71-
- Very large datasets (100M+ records): Set ARCADEDB_JVM_MAX_HEAP="8g" or higher
71+
- Very large datasets (100M+ records): Set ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" or higher
7272
- Must be set BEFORE running the script (before JVM starts)
7373
7474
Dataset Options:
@@ -1001,16 +1001,22 @@ def check_dataset_exists(data_dir):
10011001
print()
10021002

10031003
# Check JVM heap configuration for large imports
1004-
jvm_heap = os.environ.get("ARCADEDB_JVM_MAX_HEAP")
1005-
if jvm_heap:
1006-
print(f"💡 JVM Max Heap: {jvm_heap}")
1004+
jvm_args = os.environ.get("ARCADEDB_JVM_ARGS")
1005+
if jvm_args and "-Xmx" in jvm_args:
1006+
import re
1007+
1008+
match = re.search(r"-Xmx(\S+)", jvm_args)
1009+
heap_size = match.group(1) if match else "unknown"
1010+
print(f"💡 JVM Max Heap: {heap_size}")
10071011
else:
10081012
print("💡 JVM Max Heap: 4g (default)")
10091013
print(" ℹ️ Using default JVM heap (4g)")
10101014
if args.dataset == "movielens-large":
10111015
print(" 💡 For large datasets, you can increase it:")
1012-
print(' export ARCADEDB_JVM_MAX_HEAP="8g" # or run with:')
1013-
print(' ARCADEDB_JVM_MAX_HEAP="8g" python 04_csv_import_documents.py')
1016+
print(' export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" # or run with:')
1017+
print(
1018+
' ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g" python 04_csv_import_documents.py'
1019+
)
10141020
print()
10151021

10161022
# -----------------------------------------------------------------------------

bindings/python/examples/06_vector_search_recommendations.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
- Best for real-time recommendations
2121
2222
For the large dataset (20M ratings), use these environment variables:
23-
ARCADEDB_JVM_MAX_HEAP="8g" ARCADEDB_JVM_ARGS="-Xms8g"
23+
ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"
2424
2525
KNOWN ISSUES: ArcadeDB Bugs and Limitations
2626
--------------------------------------------

bindings/python/examples/07_stackoverflow_multimodel.py

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@
1212
to build a comprehensive knowledge graph with semantic search capabilities.
1313
1414
Dataset Options (disk size → recommended JVM heap):
15-
- stackoverflow-tiny: ~34 MB → 2 GB (ARCADEDB_JVM_MAX_HEAP='2g' ARCADEDB_JVM_ARGS='-Xms2g')
16-
- stackoverflow-small: ~642 MB → 8 GB (ARCADEDB_JVM_MAX_HEAP='4g' ARCADEDB_JVM_ARGS='-Xms8g')
17-
- stackoverflow-medium: ~2.9 GB → 32 GB (ARCADEDB_JVM_MAX_HEAP='32g' ARCADEDB_JVM_ARGS='-Xms32g')
18-
- stackoverflow-large: ~323 GB → 64+ GB (ARCADEDB_JVM_MAX_HEAP='64g' ARCADEDB_JVM_ARGS='-Xms64g')
15+
- stackoverflow-tiny: ~34 MB → 2 GB (ARCADEDB_JVM_ARGS='-Xmx2g -Xms2g')
16+
- stackoverflow-small: ~642 MB → 8 GB (ARCADEDB_JVM_ARGS='-Xmx8g -Xms8g')
17+
- stackoverflow-medium: ~2.9 GB → 32 GB (ARCADEDB_JVM_ARGS='-Xmx32g -Xms32g')
18+
- stackoverflow-large: ~323 GB → 64+ GB (ARCADEDB_JVM_ARGS='-Xmx64g -Xms64g')
1919
2020
Usage:
2121
# Phase 1 only (import + index)
@@ -6401,14 +6401,18 @@ def main():
64016401
sys.exit(1)
64026402

64036403
# Check JVM heap configuration
6404-
jvm_heap = os.environ.get("ARCADEDB_JVM_MAX_HEAP")
6405-
if jvm_heap:
6406-
print(f"💡 JVM Max Heap: {jvm_heap}")
6404+
jvm_args = os.environ.get("ARCADEDB_JVM_ARGS")
6405+
if jvm_args and "-Xmx" in jvm_args:
6406+
import re
6407+
6408+
match = re.search(r"-Xmx(\S+)", jvm_args)
6409+
heap_size = match.group(1) if match else "unknown"
6410+
print(f"💡 JVM Max Heap: {heap_size}")
64076411
else:
64086412
print("💡 JVM Max Heap: 4g (default)")
64096413
if args.dataset in ["stackoverflow-medium", "stackoverflow-large"]:
64106414
print(" ⚠️ Consider increasing heap for large datasets:")
6411-
print(' export ARCADEDB_JVM_MAX_HEAP="8g"')
6415+
print(' export ARCADEDB_JVM_ARGS="-Xmx8g -Xms8g"')
64126416
print()
64136417

64146418
# Schema analysis mode

0 commit comments

Comments
 (0)