Skip to content

Commit 8d962cc

Browse files
committed
Refactor vector index scripts and enhance ArcadeDB integration
- Updated `run_11_vector_index_build_matrix.sh` to use a smaller dataset and reduced resource limits for testing. - Modified `run_12_vector_search_matrix.sh` to replace overquery factors with ef_search values for improved clarity and consistency. - Adjusted `summarize_12_vector_search_matrix.sh` to reflect changes in search parameters and updated documentation accordingly. - Enhanced `arcadedb_embedded/core.py` and `arcadedb_embedded/vector.py` to improve handling of PRODUCT quantization and ef_search parameters. - Updated tests in `test_vector.py` to validate new ef_search functionality and ensure proper error handling for invalid inputs. - Increased test coverage by adding scenarios for PRODUCT quantization requirements and ef_search validation.
1 parent 70173a7 commit 8d962cc

31 files changed

Lines changed: 479 additions & 537 deletions

.github/workflows/test-python-examples.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -367,7 +367,7 @@ jobs:
367367
echo ""
368368
continue
369369
fi
370-
example_args="--backend arcadedb_sql --dataset stackoverflow-tiny --db-path $db_path --overquery-factors 1 --k 10 --query-limit 100 --query-runs 1 --query-order fixed --threads 1 --mem-limit 2g --run-label ci12_arcadedb_sql"
370+
example_args="--backend arcadedb_sql --dataset stackoverflow-tiny --db-path $db_path --ef-search-values 100 --k 10 --query-limit 100 --query-runs 1 --query-order fixed --threads 1 --mem-limit 2g --run-label ci12_arcadedb_sql"
371371
example_name="$example (vector search, arcadedb_sql backend, minimal)"
372372
timeout_duration=1200
373373
example_jvm_args=""

bindings/python/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Native Python bindings for ArcadeDB - the multi-model database that supports Graph, Document, Key/Value, Search Engine, Time Series, and Vector models.
44

5-
**Status**: ✅ Production Ready | **Tests**: 279 Passed Across 27 Test Files | **Platforms**: 4 Supported
5+
**Status**: ✅ Production Ready | **Tests**: 282 Passed | **Platforms**: 4 Supported
66

77
---
88

@@ -92,7 +92,7 @@ Import: `import arcadedb_embedded as arcadedb`
9292

9393
## 🧪 Testing
9494

95-
**Status**: 279 passed across 27 test files
95+
**Status**: 282 passed
9696

9797
```bash
9898
# Run all tests

bindings/python/docs/api/database.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -666,17 +666,25 @@ Create a vector index for similarity search (JVector implementation). Existing r
666666
- `vector_property` (str): Property storing vector arrays
667667
- `dimensions` (int): Vector dimensionality
668668
- `distance_function` (str): `"cosine"`, `"euclidean"`, or `"inner_product"`
669-
- `max_connections` (int): Max connections per node (default: 16). Maps to `maxConnections` in HNSW (JVector).
670-
- `beam_width` (int): Beam width for search/construction (default: 100). Maps to `beamWidth` in HNSW (JVector).
671-
- `quantization` (str | None): `"INT8"` (recommended), `"BINARY"`, `"PRODUCT"` for PQ, or `None` for full precision (default: `"INT8"`).
672-
Prefer `"INT8"` for current production usage in these bindings; `"PRODUCT"`/PQ is currently not recommended for production workloads.
669+
- `max_connections` (int): Max connections per node (default: 16). Maps to
670+
`maxConnections` in HNSW (JVector).
671+
- `beam_width` (int): Beam width for search/construction (default: 100). Maps to
672+
`beamWidth` in HNSW (JVector).
673+
- `quantization` (str | None): `"INT8"` (recommended), `"BINARY"`, `"PRODUCT"` for PQ,
674+
or `None` for full precision (default: `"INT8"`). Prefer `"INT8"` for current
675+
production usage in these bindings; `"PRODUCT"`/PQ is currently not recommended for
676+
production workloads. In current ArcadeDB engine builds, `"PRODUCT"` also requires
677+
enough indexed vectors per bucket for PQ training. For tiny corpora, set `pq_clusters`
678+
explicitly to a small value or prefer another quantization mode.
673679
- `location_cache_size` (int | None): Override location cache size (default: `None`, uses engine default).
674680
- `graph_build_cache_size` (int | None): Override graph build cache size (default: `None`, uses engine default).
675681
- `mutations_before_rebuild` (int | None): Override rebuild threshold (default: `None`, uses engine default).
676682
- `store_vectors_in_graph` (bool): Persist vectors inline in graph file (faster reopen/search, larger graph).
677683
- `add_hierarchy` (bool | None): Force enabling/disabling HNSW hierarchy (default: `True`).
678684
- `pq_subspaces` (int | None): PQ subspaces (M). Requires `quantization="PRODUCT"`.
679-
- `pq_clusters` (int | None): PQ clusters per subspace (K). Requires `quantization="PRODUCT"`.
685+
- `pq_clusters` (int | None): PQ clusters per subspace (K). Requires
686+
`quantization="PRODUCT"`. In current ArcadeDB engine builds, this should not exceed
687+
the number of indexed vectors available for PQ training in a bucket.
680688
- `pq_center_globally` (bool | None): PQ global centering flag. Requires `quantization="PRODUCT"`.
681689
- `pq_training_limit` (int | None): PQ training sample cap. Requires `quantization="PRODUCT"`.
682690
- `build_graph_now` (bool): If `True` (default), eagerly builds/loads the vector graph immediately after index creation. Set to `False` to defer graph preparation to first query.

bindings/python/docs/api/schema.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -282,10 +282,13 @@ schema.create_index("Article", ["content"], index_type="FULL_TEXT")
282282

283283
**Vector (JVector) Parameters:**
284284

285-
- **max_connections**: Max connections per node (default: 16; typical 8-32). Maps to JVector `maxConnections`.
286-
- **beam_width**: Beam width for build/search (default: 100; typical 64-200). Maps to JVector `beamWidth`.
285+
- **max_connections**: Max connections per node (default: 16; typical 8-32). Maps to
286+
JVector `maxConnections`.
287+
- **beam_width**: Beam width for build/search (default: 100; typical 64-200). Maps to
288+
JVector `beamWidth`.
287289
- **dimensions**: Vector size (must match your embeddings).
288-
- **overquery_factor**: Search-time candidate multiplier (default: 4; typical 2-8). Higher improves recall with slower search.
290+
- **ef_search**: Query-time exact-search beam width override via `find_nearest(...,
291+
ef_search=...)`. Leave unset to use ArcadeDB's default/adaptive behavior.
289292

290293
## Type Inspection
291294

bindings/python/docs/api/vector.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -151,11 +151,15 @@ db.create_vector_index(
151151
- Maps to `beamWidth` in JVector
152152
- Higher = better recall, slower search
153153
- Typical range: 50-500
154-
- `quantization` (str | None): `"INT8"` (recommended), `"BINARY"`, `"PRODUCT"`, or `None` (default: `"INT8"`)
154+
- `quantization` (str | None): `"INT8"` (recommended), `"BINARY"`, `"PRODUCT"`, or
155+
`None` (default: `"INT8"`)
156+
- In current ArcadeDB engine builds, `"PRODUCT"` also requires enough indexed
157+
vectors per bucket for PQ training. For tiny corpora, set `pq_clusters` explicitly
158+
to a small value or prefer `"INT8"`, `"BINARY"`, or `None`.
155159
- Prefer `"INT8"` for current production usage in these bindings.
156160
- `"PRODUCT"`/PQ is available but currently not recommended for production workloads.
157-
- `build_graph_now` (bool): If `True` (default), eagerly prepares the vector graph during index creation.
158-
Set to `False` to defer graph preparation until first query.
161+
- `build_graph_now` (bool): If `True` (default), eagerly prepares the vector graph
162+
during index creation. Set to `False` to defer graph preparation until first query.
159163

160164
**Returns:**
161165

@@ -192,7 +196,7 @@ print(f"Created vector index: {index}")
192196

193197
---
194198

195-
### `VectorIndex.find_nearest(query_vector, k=10, overquery_factor=4, allowed_rids=None)`
199+
### `VectorIndex.find_nearest(query_vector, k=10, ef_search=None, allowed_rids=None)`
196200

197201
Find k-nearest neighbors to the query vector.
198202

@@ -207,8 +211,8 @@ to `find_nearest` may perform lazy graph preparation and therefore take longer.
207211
- NumPy array: `np.array([0.1, 0.2, ...])`
208212
- Any array-like iterable
209213
- `k` (int): Number of neighbors to return (default: 10)
210-
- `overquery_factor` (int): Multiplier for search-time over-querying (implicit efSearch)
211-
(default: 4)
214+
- `ef_search` (int | None): Optional exact-search beam width override. `None` uses
215+
ArcadeDB's default/adaptive search behavior.
212216
- `allowed_rids` (List[str]): Optional list of RID strings (e.g. `["#1:0", "#2:5"]`) to
213217
restrict search (default: `None`)
214218

@@ -514,11 +518,12 @@ db.close()
514518
- **Medium (16)**: Balanced (default)
515519
- **Higher (32)**: Better recall, more memory, slower build
516520

517-
**overquery_factor (search size):**
521+
**ef_search (exact search beam width):**
518522

519-
- **Lower (2)**: Faster search, lower recall
520-
- **Medium (4)**: Balanced (default)
521-
- **Higher (8)**: Better recall, slower search
523+
- **Unset (`None`)**: Use ArcadeDB's default/adaptive behavior
524+
- **Lower (32)**: Faster search, lower recall
525+
- **Medium (100)**: Balanced explicit override
526+
- **Higher (200)**: Better recall, slower search
522527

523528
**beam_width:**
524529

bindings/python/docs/development/build-architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ This document describes the build architecture for creating platform-specific Py
1919

2020
**All supported platforms:**
2121

22-
- ✅ Current suite: 279 passed
22+
- ✅ Current suite: 282 passed
2323
- ✅ 31.7M JARs (83 files, identical across platforms)
2424
- ✅ All native runners (no QEMU emulation)
2525
- ✅ Reproducible builds (pinned runner versions)

bindings/python/docs/development/ci-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ All 4 platforms passing the bindings suite and example workflows:
102102

103103
| Platforms | Wheel Size | JRE Size | Tests |
104104
|-----------|-----------|----------|-------|
105-
| linux/amd64, linux/arm64, darwin/arm64, windows/amd64 | ~70-75M | ~60M | 279 passed ✅ |
105+
| linux/amd64, linux/arm64, darwin/arm64, windows/amd64 | ~70-75M | ~60M | 282 passed ✅ |
106106

107107
**All platforms include:**
108108

bindings/python/docs/development/testing.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
Comprehensive testing documentation for ArcadeDB Python bindings.
44

55
!!! success "Test Coverage"
6-
**27 test files** in the current bindings suite
6+
Current bindings suite
77

8-
- **Current package**: 279 passed
8+
- **Current package**: 282 passed
99
- All ArcadeDB features working (SQL, OpenCypher, Studio)
1010

1111
## Quick Navigation

bindings/python/docs/development/testing/overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ The ArcadeDB Python bindings have a comprehensive test suite covering all major
55
## Quick Statistics
66

77
!!! success "Test Results"
8-
- **Current package**: ✅ 279 passed across 27 test files
8+
- **Current package**: ✅ 282 passed
99
- Environment-specific skips may vary depending on optional components
1010

1111
## What's Tested
@@ -130,7 +130,7 @@ pytest -m "not slow"
130130
When the current bindings test suite passes, you should see a clean all-green summary.
131131

132132
```
133-
======================== 279 passed ========================
133+
======================== 282 passed ========================
134134
```
135135

136136

bindings/python/docs/development/testing/test-schema.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -408,7 +408,8 @@ with arcadedb.create_database("./test_db") as db:
408408
3. **Set constraints** - Use `set_mandatory()`, `set_not_null()`
409409
4. **Index frequently queried** - Properties used in WHERE clauses
410410
5. **Use appropriate types** - Match property types to data
411-
6. **Configure JVector** - Tune `max_connections`, `beam_width`, `overquery_factor`, `dimensions` for vectors
411+
6. **Configure JVector** - Tune `max_connections`, `beam_width`, `dimensions`, and
412+
optional query-time `ef_search` overrides for vectors
412413

413414
## See Also
414415

0 commit comments

Comments
 (0)