Skip to content

Commit 1139c05

Browse files
committed
Enhance Python bindings for ArcadeDB with performance improvements and new features
- Updated benchmarking results in `results.md` to reflect current performance metrics and improvements. - Added `get_java_database` method to the `Database` class for internal integrations. - Refactored the `export_database` function in `exporter.py` to improve parameter naming and error handling. - Introduced `iter_dicts` method in `ResultSet` for more efficient iteration over results. - Enhanced `Document` and `Result` classes with caching for property names and improved type conversion handling. - Improved error handling and type checking in various methods across the codebase.
1 parent 95fb0be commit 1139c05

17 files changed

Lines changed: 5235 additions & 11074 deletions

bindings/python/docs/api/record-wrappers.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,10 @@ with db.transaction():
136136
Convert the document to a Python dictionary of its properties (metadata like RID/type
137137
is not included). Use `get_rid()` for the record ID if needed.
138138

139+
**Performance note:** `to_dict()` eagerly converts the full document into Python
140+
data. For large scans or repeated wrapper access, prefer `get()` when you only need
141+
specific fields.
142+
139143
```python
140144
doc_dict = doc.to_dict()
141145
print(doc_dict)

bindings/python/docs/api/results.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,11 @@ for person in people:
276276
- Passing data to other libraries
277277
- Debugging/inspection
278278

279+
**Performance note:** `to_dict()` eagerly converts the current row to Python data.
280+
That is convenient for small projections and interop, but repeated `to_dict()` calls
281+
across a large result set will allocate Python objects for every returned property.
282+
For large scans, prefer iterating and reading only the fields you need with `get()`.
283+
279284
---
280285

281286
### `to_json() -> str`
@@ -307,6 +312,10 @@ for result in result_set:
307312

308313
### Converting to Lists and Dicts
309314

315+
`ResultSet.to_list()` and `Result.to_dict()` are eager materializers. They are the
316+
right choice when you explicitly want Python-native data, but they are not the
317+
lowest-overhead path for large result sets.
318+
310319
```python
311320
# List of dictionaries (most common)
312321
result_set = db.query("sql", "SELECT FROM User")
@@ -352,7 +361,13 @@ print(df.head())
352361
For memory efficiency with large datasets:
353362

354363
```python
355-
# Process in batches
364+
# Stream one row at a time when you only need a few fields
365+
result_set = db.query("sql", "SELECT name, email FROM LargeTable")
366+
367+
for result in result_set:
368+
process_row(result.get("name"), result.get("email"))
369+
370+
# Or process in batches when you do need dict materialization
356371
result_set = db.query("sql", "SELECT FROM LargeTable")
357372

358373
batch = []

bindings/python/docs/development/troubleshooting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Common issues, solutions, and debugging techniques for ArcadeDB Python bindings.
2727
(no external Java install is needed):
2828

2929
```bash
30-
uv pip uninstall -y arcadedb-embedded
30+
uv pip uninstall arcadedb-embedded
3131
uv pip install --no-cache-dir arcadedb-embedded
3232
```
3333

bindings/python/docs/guide/core/queries.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,10 @@ records share the same exact-match value, such as `customerId`, `status`, or `co
372372

373373
### ResultSet Methods
374374

375+
Use `first()` or direct iteration when you want the lowest-overhead path.
376+
`to_list()` eagerly materializes the full result set into Python dictionaries, so it
377+
is best reserved for smaller results or explicit interop steps.
378+
375379
```python
376380
# first() - get first result
377381
result = db.query("sql", "SELECT FROM Person ORDER BY name")
@@ -393,6 +397,34 @@ else:
393397
print("No results found")
394398
```
395399

400+
### Performance and Materialization
401+
402+
Use lazy access inside the hot path, and materialize only when you explicitly need
403+
Python-native containers.
404+
405+
- Use `first()` when you only need one row.
406+
- Use direct iteration plus `get()` for large scans and request-time processing.
407+
- Use `to_list()` when you need to keep all rows in Python, hand them to another
408+
library, or serialize them as a batch.
409+
- Use `iter_chunks()` when you need batch processing without loading everything at
410+
once.
411+
- Use wrapper `to_dict()` only when you truly want the full document in Python.
412+
413+
```python
414+
# Lowest-overhead path for large scans
415+
result = db.query("sql", "SELECT name, score FROM Item WHERE score > ?", 100)
416+
for row in result:
417+
handle(row.get("name"), row.get("score"))
418+
419+
# Materialize only at the boundary where Python-native data is needed
420+
result = db.query("sql", "SELECT name, score FROM Item WHERE score > ?", 100)
421+
payload = result.to_list()
422+
423+
# For wrappers, prefer field access over full dict conversion in large loops
424+
for doc in db.query("sql", "SELECT FROM Person"):
425+
process(doc.get("name"), doc.get("city"))
426+
```
427+
396428
## OpenCypher
397429

398430
OpenCypher provides expressive graph pattern matching.

bindings/python/examples/12_vector_search.py

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -604,14 +604,13 @@ def _extract_result_id(rec) -> int | None:
604604
"sql",
605605
sql,
606606
).first()
607-
latencies_ms.append((time.perf_counter() - start) * 1000)
608-
609607
neighbors = row.get("res") if row else []
610608
result_ids: List[int] = []
611609
for rec in neighbors:
612610
rid = _extract_result_id(rec)
613611
if rid is not None:
614612
result_ids.append(rid)
613+
latencies_ms.append((time.perf_counter() - start) * 1000)
615614

616615
gt_list = gt_full.get(qid)
617616
if not gt_list:
@@ -656,16 +655,14 @@ def search_faiss(
656655
hnsw.efSearch = int(ef_search)
657656

658657
for q_idx, qid in enumerate(qids):
658+
start = time.perf_counter()
659659
qvec = np.ascontiguousarray(
660660
queries[q_idx : q_idx + 1].astype("float32", copy=True)
661661
)
662662
faiss.normalize_L2(qvec)
663-
664-
start = time.perf_counter()
665663
_dist, ids = index.search(qvec, int(k))
666-
latencies_ms.append((time.perf_counter() - start) * 1000)
667-
668664
result_ids = [int(doc_id) for doc_id in ids[0].tolist() if int(doc_id) >= 0]
665+
latencies_ms.append((time.perf_counter() - start) * 1000)
669666

670667
gt_list = gt_full.get(qid)
671668
if not gt_list:
@@ -732,13 +729,13 @@ def search_lancedb(
732729
except Exception:
733730
pass
734731
rows = search.to_list()
735-
latencies_ms.append((time.perf_counter() - start) * 1000)
736732

737733
result_ids: List[int] = []
738734
for row in rows:
739735
rid = row.get("id") if isinstance(row, dict) else None
740736
if rid is not None:
741737
result_ids.append(int(rid))
738+
latencies_ms.append((time.perf_counter() - start) * 1000)
742739

743740
gt_list = gt_full.get(qid)
744741
if not gt_list:
@@ -787,9 +784,8 @@ def search_bruteforce(
787784
else:
788785
candidate_idx = np.argpartition(scores, -topk)[-topk:]
789786
ranked_idx = candidate_idx[np.argsort(scores[candidate_idx])[::-1]]
790-
latencies_ms.append((time.perf_counter() - start) * 1000)
791-
792787
result_ids = [int(doc_id) for doc_id in ranked_idx.tolist()]
788+
latencies_ms.append((time.perf_counter() - start) * 1000)
793789

794790
gt_list = gt_full.get(qid)
795791
if not gt_list:
@@ -841,9 +837,8 @@ def search_pgvector(
841837
(vector_to_pg_literal(queries[q_idx]), int(k)),
842838
)
843839
rows = cur.fetchall()
844-
latencies_ms.append((time.perf_counter() - start) * 1000)
845-
846840
result_ids = [int(row[0]) for row in rows]
841+
latencies_ms.append((time.perf_counter() - start) * 1000)
847842
gt_list = gt_full.get(qid)
848843
if not gt_list:
849844
continue
@@ -886,14 +881,13 @@ def search_qdrant(
886881
with_payload=False,
887882
with_vectors=False,
888883
)
889-
latencies_ms.append((time.perf_counter() - start) * 1000)
890-
891884
points = getattr(response, "points", response)
892885
result_ids: List[int] = []
893886
for point in points:
894887
point_id = getattr(point, "id", None)
895888
if point_id is not None:
896889
result_ids.append(int(point_id))
890+
latencies_ms.append((time.perf_counter() - start) * 1000)
897891

898892
gt_list = gt_full.get(qid)
899893
if not gt_list:
@@ -1077,10 +1071,10 @@ def is_transient_milvus_search_error(exc: Exception) -> bool:
10771071

10781072
if rows is None:
10791073
raise RuntimeError("Milvus search returned no result rows")
1080-
latencies_ms.append((time.perf_counter() - start) * 1000)
10811074

10821075
hits = rows[0] if rows else []
10831076
result_ids = [int(getattr(hit, "id", -1)) for hit in hits]
1077+
latencies_ms.append((time.perf_counter() - start) * 1000)
10841078

10851079
gt_list = gt_full.get(qid)
10861080
if not gt_list:

bindings/python/examples/benchmark_results/summary_10_graph_olap_all_datasets.md

Lines changed: 0 additions & 94 deletions
This file was deleted.

bindings/python/examples/scripts/run_11_vector_index_build_matrix.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,16 @@ source "$HELPERS_SH"
1515
# Large 10,000 32GB 16
1616
# X-Large 25,000 64GB 32
1717

18-
DATASET="stackoverflow-medium"
19-
BATCH_SIZE=5000
20-
MEM_LIMIT="8g"
18+
DATASET="stackoverflow-large"
19+
BATCH_SIZE=10000
20+
MEM_LIMIT="32g"
2121
THREADS=8
2222
RUNS=1
2323
SEED_START=0
2424
SERVER_FRACTION="0.8"
2525
MAX_CONNECTIONS=16
2626
BEAM_WIDTH=100
27-
QUANTIZATION="INT8"
27+
QUANTIZATION="NONE"
2828
STORE_VECTORS_IN_GRAPH=false
2929
ADD_HIERARCHY=true
3030
JVM_HEAP_FRACTION="0.80"

bindings/python/examples/scripts/run_12_vector_search_matrix.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ source "$HELPERS_SH"
1515
# Large 16GB 16
1616
# X-Large 32GB 32
1717

18-
DATASET="stackoverflow-large"
19-
MEM_LIMIT="16g"
18+
DATASET="stackoverflow-medium"
19+
MEM_LIMIT="4g"
2020
THREADS=8
2121
RUNS=1
2222
SEED_START=0

0 commit comments

Comments
 (0)