Skip to content

Commit 5ecc0e2

Browse files
committed
Refactor vector index tests and add document vector search functionality
- Updated test_list_vector_indexes_empty to clarify the absence of vector indexes. - Removed tests related to HNSW indexes to streamline focus on vector indexes. - Enhanced test_get_vector_index_non_existent to improve clarity on non-existent index retrieval. - Modified test_get_vector_index_wrong_type to reflect the distinction between regular and vector indexes. - Introduced test_document_vector_search to validate vector search capabilities on Document types. - Added test_document_vector_search_sql to verify KNN search functionality using SQL on Document types with vector indexes.
1 parent 1addedd commit 5ecc0e2

33 files changed

Lines changed: 722 additions & 2624 deletions

.github/workflows/test-python-examples.yml

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,16 @@ jobs:
179179
pip install numpy requests sentence-transformers
180180
fi
181181
182+
- name: Download datasets
183+
shell: bash
184+
run: |
185+
pip install tqdm py7zr lxml
186+
cd bindings/python/examples
187+
echo "📥 Downloading MovieLens Small dataset..."
188+
python3 download_data.py movielens-small
189+
echo "📥 Downloading Stack Overflow Small dataset..."
190+
python3 download_data.py stackoverflow-small
191+
182192
- name: Install timeout command (macOS only)
183193
if: matrix.platform == 'darwin/amd64' || matrix.platform == 'darwin/arm64'
184194
shell: bash
@@ -215,8 +225,8 @@ jobs:
215225
results_file="example-results.txt"
216226
> $results_file
217227
218-
# Find all Python example files (exclude download_data.py as it's a utility)
219-
examples=$(ls [0-9]*.py 2>/dev/null | sort)
228+
# Find all Python example files (exclude download_data.py and 08_server_mode_rest_api.py)
229+
examples=$(ls [0-9]*.py 2>/dev/null | grep -v "08_server_mode_rest_api.py" | sort)
220230
221231
if [ -z "$examples" ]; then
222232
echo "❌ No example files found!"
@@ -248,20 +258,25 @@ jobs:
248258
# Set example-specific parameters and timeout
249259
case "$example" in
250260
"04_csv_import_documents.py")
251-
example_args="--dataset movielens-small --parallel 4 --batch-size 5000 --export"
261+
example_args="--dataset movielens-small --export"
252262
example_name="$example (movielens-small dataset with export)"
253263
timeout_duration=900 # 15 minutes
254264
;;
255265
"05_csv_import_graph.py")
256-
example_args="--dataset movielens-small --parallel 1 --no-async --export --import-jsonl ./exports/movielens_small_db.jsonl.tgz"
257-
example_name="$example (movielens-small dataset, sync mode, import from JSONL)"
266+
example_args="--dataset movielens-small --method java --import-jsonl ./exports/movielens_small_db.jsonl.tgz --export"
267+
example_name="$example (movielens-small dataset, embedded java method, import/export)"
258268
timeout_duration=900 # 15 minutes
259269
;;
260270
"06_vector_search_recommendations.py")
261-
example_args="--source-db my_test_databases/movielens_graph_small_db --db-path my_test_databases/movielens_graph_small_db_vectors --import-jsonl exports/movielens_graph_small_db.jsonl.tgz"
271+
example_args="--import-jsonl ./exports/movielens_graph_small_db.jsonl.tgz"
262272
example_name="$example (vector search, import from JSONL)"
263273
timeout_duration=900 # 15 minutes
264274
;;
275+
"07_stackoverflow_multimodel.py")
276+
example_args="--dataset stackoverflow-small"
277+
example_name="$example (stackoverflow-small dataset)"
278+
timeout_duration=1800 # 30 minutes
279+
;;
265280
*)
266281
example_args=""
267282
example_name="$example"
@@ -368,12 +383,12 @@ jobs:
368383
echo "" >> $GITHUB_STEP_SUMMARY
369384
echo "- **01_simple_document_store.py** - Document CRUD operations with comprehensive data types" >> $GITHUB_STEP_SUMMARY
370385
echo "- **02_social_network_graph.py** - Graph modeling with vertices, edges, and traversal" >> $GITHUB_STEP_SUMMARY
371-
echo "- **03_vector_search.py** - Vector embeddings and semantic similarity search (experimental)" >> $GITHUB_STEP_SUMMARY
386+
echo "- **03_vector_search.py** - Vector embeddings and semantic similarity search" >> $GITHUB_STEP_SUMMARY
372387
echo "- **04_csv_import_documents.py** - CSV import with automatic dataset download and type inference" >> $GITHUB_STEP_SUMMARY
373388
echo " - Tested with \`--dataset movielens-small --parallel 4 --batch-size 5000 --export\`" >> $GITHUB_STEP_SUMMARY
374389
echo "- **05_csv_import_graph.py** - Graph creation from document store with benchmarking" >> $GITHUB_STEP_SUMMARY
375390
echo " - Tested with \`--dataset movielens-small --parallel 1 --no-async --export --import-jsonl\`" >> $GITHUB_STEP_SUMMARY
376-
echo "- **06_vector_search_recommendations.py** - HNSW vector indexing for movie recommendations" >> $GITHUB_STEP_SUMMARY
391+
echo "- **06_vector_search_recommendations.py** - JVector vector indexing for movie recommendations" >> $GITHUB_STEP_SUMMARY
377392
echo " - Tested with \`--import-jsonl exports/movielens_graph_small_db.jsonl.tgz\`" >> $GITHUB_STEP_SUMMARY
378393
echo "" >> $GITHUB_STEP_SUMMARY
379394

bindings/python/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ with arcadedb.create_database("/tmp/mydb") as db:
8989
- 🔍 **Multiple query languages**: SQL, Cypher, Gremlin, MongoDB
9090
-**High performance**: Direct JVM integration via JPype
9191
- 🔒 **ACID transactions**: Full transaction support
92-
- 🎯 **Vector storage**: Store and query vector embeddings with HNSW indexing
92+
- 🎯 **Vector storage**: Store and query vector embeddings with JVector indexing
9393
- 📥 **Data import**: Built-in CSV, JSON, Neo4j importers
9494

9595
---
@@ -221,7 +221,7 @@ arcadedb_embedded/
221221
├── server.py # ArcadeDBServer for HTTP mode
222222
├── results.py # ResultSet and Result wrappers
223223
├── transactions.py # TransactionContext manager
224-
├── vector.py # Vector search and HNSW indexing
224+
├── vector.py # Vector search and JVector indexing
225225
├── importer.py # Data import (CSV, JSON, Neo4j)
226226
├── exceptions.py # ArcadeDBError exception
227227
└── jvm.py # JVM lifecycle management

bindings/python/docs/api/database.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -386,13 +386,15 @@ db.create_vector_index(
386386
vector_property: str,
387387
dimensions: int,
388388
distance_function: str = "cosine",
389-
max_connections: int = 16,
390-
beam_width: int = 200
389+
max_connections: int = 32,
390+
beam_width: int = 256
391391
) -> VectorIndex
392392
```
393393

394394
Create a vector index for similarity search (default JVector implementation).
395395

396+
**Note:** The index is built lazily. Construction happens upon the first query, not at creation time.
397+
396398
**Parameters:**
397399

398400
- `vertex_type` (str): Vertex type containing vectors
@@ -427,7 +429,6 @@ with db.transaction():
427429
vertex.set("id", f"doc_{i}")
428430
vertex.set("embedding", arcadedb.to_java_float_array(embedding))
429431
vertex.save()
430-
index.add_vertex(vertex)
431432

432433
# Search
433434
query_vector = np.random.rand(384)
@@ -698,6 +699,6 @@ else:
698699
## See Also
699700

700701
- [Graph Operations](../guide/graphs.md): Working with vertices and edges
701-
- [Vector Search](../guide/vectors.md): Similarity search with HNSW indexes
702+
- [Vector Search](../guide/vectors.md): Similarity search with JVector indexes
702703
- [Server Mode](../guide/server.md): HTTP API and Studio UI
703704
- [Quick Start](../getting-started/quickstart.md): Getting started guide

bindings/python/docs/api/vector.md

Lines changed: 33 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Vector API
22

3-
Vector search capabilities in ArcadeDB use HNSW (Hierarchical Navigable Small World) indexing for fast approximate nearest neighbor search. Perfect for semantic search, recommendation systems, and similarity-based queries.
3+
Vector search capabilities in ArcadeDB use JVector (a graph-based index combining HNSW
4+
and DiskANN concepts) for fast approximate nearest neighbor search. Perfect for semantic
5+
search, recommendation systems, and similarity-based queries.
46

57
## Overview
68

@@ -13,7 +15,7 @@ ArcadeDB's vector support enables:
1315

1416
**Key Features:**
1517

16-
- HNSW indexing for O(log N) search performance
18+
- Graph-based indexing for O(log N) search performance
1719
- Multiple distance metrics (cosine, euclidean, inner product)
1820
- Native NumPy integration (optional)
1921
- Configurable precision/performance trade-offs
@@ -24,7 +26,8 @@ Utility functions for converting between Python and Java vector representations:
2426

2527
### `to_java_float_array(vector)`
2628

27-
Convert a Python array-like object to a Java float array compatible with ArcadeDB's vector indexing.
29+
Convert a Python array-like object to a Java float array compatible with ArcadeDB's
30+
vector indexing.
2831

2932
**Parameters:**
3033

@@ -98,7 +101,7 @@ print(type(py_list)) # <class 'list'>
98101

99102
## VectorIndex Class
100103

101-
Wrapper for ArcadeDB's HNSW vector index, providing similarity search capabilities.
104+
Wrapper for ArcadeDB's vector index, providing similarity search capabilities.
102105

103106
### Creation via Database
104107

@@ -171,10 +174,13 @@ print(f"Created vector index: {index}")
171174

172175
---
173176

174-
### `VectorIndex.find_nearest(query_vector, k=10, overquery_factor=16, use_numpy=True, allowed_rids=None)`
177+
### `VectorIndex.find_nearest(query_vector, k=10, overquery_factor=16, allowed_rids=None)`
175178

176179
Find k-nearest neighbors to the query vector.
177180

181+
**Note:** The first call to `find_nearest` triggers the index construction if it hasn't
182+
been built yet. This "warm up" query may take longer than subsequent queries.
183+
178184
**Parameters:**
179185

180186
- `query_vector`: Query vector as:
@@ -184,14 +190,13 @@ Find k-nearest neighbors to the query vector.
184190
- `k` (int): Number of neighbors to return (default: 10)
185191
- `overquery_factor` (int): Multiplier for search-time over-querying (implicit efSearch)
186192
(default: 16)
187-
- `use_numpy` (bool): Return vectors as NumPy if available (default: `True`)
188193
- `allowed_rids` (List[str]): Optional list of RID strings (e.g. `["#1:0", "#2:5"]`) to
189194
restrict search (default: `None`)
190195

191196
**Returns:**
192197

193-
- `List[Tuple[vertex, float]]`: List of `(vertex, distance)` tuples
194-
- `vertex`: Matched vertex object (MutableVertex)
198+
- `List[Tuple[record, float]]`: List of `(record, distance)` tuples
199+
- `record`: Matched ArcadeDB record object (Vertex, Document, or Edge)
195200
- `distance`: Similarity score (float)
196201
- Lower = more similar
197202
- Range depends on distance function
@@ -210,9 +215,9 @@ neighbors = index.find_nearest(query_vector, k=5)
210215
allowed_rids = ["#10:5", "#10:8", "#10:12"]
211216
filtered_neighbors = index.find_nearest(query_vector, k=5, allowed_rids=allowed_rids)
212217

213-
for vertex, distance in neighbors:
214-
doc_id = vertex.get("id")
215-
text = vertex.get("text")
218+
for record, distance in neighbors:
219+
doc_id = record.get("id")
220+
text = record.get("text")
216221
print(f"Distance: {distance:.4f} | ID: {doc_id}")
217222
print(f" Text: {text[:100]}...")
218223
```
@@ -225,76 +230,12 @@ for vertex, distance in neighbors:
225230
| euclidean | [0, ∞) | ✓ (0 = identical) |
226231
| inner_product | (-∞, ∞) | ✗ (higher = more similar) |
227232

228-
---
229-
230-
### `VectorIndex.add_vertex(vertex)`
231-
232-
Add a single vertex to the index.
233-
234-
**Parameters:**
235-
236-
- `vertex`: Vertex object with vector property set
237-
238-
**Raises:**
239-
240-
- `ArcadeDBError`: If vertex cannot be added
241-
242-
**Example:**
243-
244-
```python
245-
# Add during vertex creation
246-
with db.transaction():
247-
doc = db.new_vertex("Document")
248-
doc.set("id", "doc_001")
249-
doc.set("text", "Introduction to vector search")
250-
doc.set("embedding", to_java_float_array(embedding))
251-
doc.save()
252-
253-
# Add to index
254-
index.add_vertex(doc)
255-
```
256-
257-
**Important:**
258-
259233
- Vertex must have the vector property populated
260234
- Vector dimensionality must match index dimensions
261235
- Call within a transaction for consistency
262236

263237
---
264238

265-
### `VectorIndex.remove_vertex(vertex_id)`
266-
267-
Remove a vertex from the index.
268-
269-
**Parameters:**
270-
271-
- `vertex_id`: ID of the vertex to remove (typically string or int)
272-
273-
**Raises:**
274-
275-
- `ArcadeDBError`: If removal fails
276-
277-
**Example:**
278-
279-
```python
280-
# Remove by ID
281-
vertex_id = "doc_001"
282-
index.remove_vertex(vertex_id)
283-
```
284-
285-
**Note:** This removes from the vector index only, not from the database. To fully delete:
286-
287-
```python
288-
with db.transaction():
289-
# Remove from index
290-
index.remove_vertex(doc_id)
291-
292-
# Delete from database
293-
db.command("sql", f"DELETE FROM Document WHERE id = '{doc_id}'")
294-
```
295-
296-
---
297-
298239
## Complete Examples
299240

300241
### Semantic Search with Sentence Transformers
@@ -355,9 +296,6 @@ with db.transaction():
355296
vertex.set("embedding", to_java_float_array(embedding))
356297
vertex.save()
357298

358-
# Add to vector index
359-
index.add_vertex(vertex)
360-
361299
print(f"Indexed {len(documents)} documents")
362300

363301
# Search
@@ -424,7 +362,7 @@ with db.transaction():
424362
v.set("price", prod["price"])
425363
v.set("features", to_java_float_array(prod["features"]))
426364
v.save()
427-
index.add_vertex(v)
365+
# Note: LSM vector index automatically indexes new records
428366

429367
# Hybrid search: vector similarity + filters
430368
query_features = np.random.rand(128)
@@ -502,7 +440,7 @@ with db.transaction():
502440
v.set("embedding", to_java_float_array(embedding))
503441
v.save()
504442

505-
index.add_vertex(v)
443+
# Note: LSM vector index automatically indexes new records
506444

507445
# Search for similar images
508446
query_image = "query.jpg"
@@ -521,25 +459,25 @@ db.close()
521459

522460
## Performance Tuning
523461

524-
### HNSW Parameters
462+
### Vector Index Parameters
525463

526-
**M (connections per node):**
464+
**max_connections (connections per node):**
527465

528-
- **Lower (8-12)**: Faster build, less memory, lower recall
529-
- **Medium (16-24)**: Balanced (recommended)
530-
- **Higher (32-48)**: Better recall, more memory, slower build
466+
- **Lower (<32)**: Faster build, less memory, lower recall
467+
- **Medium (32)**: Balanced (recommended)
468+
- **Higher (>32)**: Better recall, more memory, slower build
531469

532-
**ef (search size):**
470+
**overquery_factor (search size):**
533471

534-
- **Lower (50-100)**: Faster search, lower recall
535-
- **Medium (128-200)**: Balanced (recommended)
536-
- **Higher (200-400)**: Better recall, slower search
472+
- **Lower (<16)**: Faster search, lower recall
473+
- **Medium (16)**: Balanced (recommended)
474+
- **Higher (>16)**: Better recall, slower search
537475

538-
**ef_construction:**
476+
**beam_width:**
539477

540-
- **Lower (100-150)**: Faster build, lower quality
541-
- **Medium (128-256)**: Balanced
542-
- **Higher (300-500)**: Better quality, slower build
478+
- **Lower (<256)**: Faster build, lower quality
479+
- **Medium (256)**: Balanced
480+
- **Higher (>256)**: Better quality, slower build
543481

544482
### Distance Functions
545483

@@ -592,7 +530,7 @@ try:
592530
v = db.new_vertex("Doc")
593531
v.set("emb", to_java_float_array(np.random.rand(512))) # Wrong size!
594532
v.save()
595-
index.add_vertex(v) # Will fail
533+
# Indexing happens automatically and may fail asynchronously or on next access
596534

597535
except ArcadeDBError as e:
598536
print(f"Error: {e}")
@@ -604,7 +542,7 @@ try:
604542
v.set("id", "doc1")
605543
# Forgot to set embedding!
606544
v.save()
607-
index.add_vertex(v) # Will fail
545+
# Indexing happens automatically
608546

609547
except ArcadeDBError as e:
610548
print(f"Error: {e}")

bindings/python/docs/development/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ arcadedb_embedded/
5151
- Type inference
5252

5353
**`vector.py`**
54-
- `VectorIndex`: HNSW vector indexing
54+
- `VectorIndex`: JVector-based Vector indexing
5555
- NumPy ↔ Java array conversion
5656
- Nearest neighbor search
5757
- Distance metrics

bindings/python/docs/development/testing/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ The test suite covers:
1919
-**Concurrency patterns** - File locking, thread safety, multi-process
2020
-**Graph operations** - Vertices, edges, traversals
2121
-**Query languages** - SQL, Cypher, Gremlin
22-
-**Vector search** - HNSW indexes, similarity search
22+
-**Vector search** - JVector-based Vector indexes, similarity search
2323
-**Data import** - CSV, JSON, Neo4j exports
2424
-**Unicode support** - International characters, emoji
2525
-**Schema introspection** - Querying database metadata

0 commit comments

Comments
 (0)