Skip to content

Commit 582b5e1

Browse files
committed
Refactor vector index creation to use SQL LSM_VECTOR
- Updated documentation and examples to reflect the preferred SQL method for creating vector indexes. - Removed Python object API calls for vector index creation in favor of SQL commands. - Ensured that SQL index creation builds the vector graph immediately by default. - Added tests to verify immediate queryability of vector indexes created via SQL. - Updated test suite to reflect the new implementation, increasing the number of passed tests from 282 to 290.
1 parent b58503a commit 582b5e1

34 files changed

Lines changed: 424 additions & 357 deletions

bindings/python/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Native Python bindings for ArcadeDB - the multi-model database that supports Graph, Document, Key/Value, Search Engine, Time Series, and Vector models.
44

5-
**Status**: ✅ Production Ready | **Tests**: 282 Passed | **Platforms**: 4 Supported
5+
**Status**: ✅ Production Ready | **Tests**: 290 Passed | **Platforms**: 4 Supported
66

77
---
88

@@ -92,7 +92,7 @@ Import: `import arcadedb_embedded as arcadedb`
9292

9393
## 🧪 Testing
9494

95-
**Status**: 282 passed
95+
**Status**: 290 passed
9696

9797
```bash
9898
# Run all tests

bindings/python/docs/api/database.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -658,7 +658,14 @@ db.create_vector_index(
658658
) -> VectorIndex
659659
```
660660

661-
Create a vector index for similarity search (JVector implementation). Existing records are indexed automatically when the index is created. By default, graph preparation is performed immediately (`build_graph_now=True`).
661+
Create a vector index for similarity search (JVector implementation). Existing records
662+
are indexed automatically when the index is created. By default, graph preparation is
663+
performed immediately (`build_graph_now=True`).
664+
665+
For normal application code and documentation examples, prefer SQL `CREATE INDEX ...
666+
LSM_VECTOR METADATA {...}` because it is cleaner and aligns with the SQL-first workflow.
667+
Keep `create_vector_index()` for Python-driven setup, tests, or manual control when you
668+
specifically need that surface.
662669

663670
**Parameters:**
664671

@@ -703,7 +710,7 @@ db.command("sql", "CREATE VERTEX TYPE Document")
703710
db.command("sql", "CREATE PROPERTY Document.embedding ARRAY_OF_FLOATS")
704711
db.command("sql", "CREATE PROPERTY Document.id STRING")
705712

706-
# Create vector index
713+
# Secondary/manual option: create vector index from Python
707714
index = db.create_vector_index("Document", "embedding", dimensions=384)
708715

709716
# Add vectors
@@ -714,17 +721,14 @@ with db.transaction():
714721
vertex.set("embedding", arcadedb.to_java_float_array(embedding))
715722
vertex.save()
716723

717-
# Search
724+
# Preferred query path: SQL search
718725
query_vector = np.random.rand(384)
719-
results = index.find_nearest(query_vector, k=5)
720-
721-
# Preferred when you want richer query composition
722726
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_vector.tolist()) + "]"
723727
rows = db.query(
724728
"sql",
725729
(
726730
"SELECT id, distance, (1 - distance) AS score "
727-
"FROM (SELECT expand(`vector.neighbors`('Document[embedding]', "
731+
"FROM (SELECT expand(vectorNeighbors('Document[embedding]', "
728732
f"{qvec_literal}, 5))) ORDER BY distance"
729733
),
730734
).to_list()

bindings/python/docs/api/schema.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -287,8 +287,9 @@ schema.create_index("Article", ["content"], index_type="FULL_TEXT")
287287
- **beam_width**: Beam width for build/search (default: 100; typical 64-200). Maps to
288288
JVector `beamWidth`.
289289
- **dimensions**: Vector size (must match your embeddings).
290-
- **ef_search**: Query-time exact-search beam width override via `find_nearest(...,
291-
ef_search=...)`. Leave unset to use ArcadeDB's default/adaptive behavior.
290+
- **ef_search**: Query-time exact-search beam width override via SQL
291+
`vectorNeighbors(..., k, ef_search)`. Leave unset to use ArcadeDB's default/adaptive
292+
behavior.
292293

293294
## Type Inspection
294295

bindings/python/docs/api/transactions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ It is important to distinguish between operations that require explicit transact
2828
| **Data Write** | `db.command("sql", "INSERT...")`, `db.command("sql", "UPDATE...")`, `db.command("sql", "DELETE...")`, `db.command("opencypher", "CREATE ...")` | **Required** (Wrap in `with db.transaction():`) |
2929
| **Bulk Operations** | `db.command("sql", "IMPORT DATABASE...")`, `db.import_documents(...)`, `db.graph_batch(...)` | **Auto-transactional / auto-managed** (Built-in transaction management) |
3030
| **Data Read** | `db.query()`, `db.command("sql", "SELECT...")`, `db.lookup_by_rid()` | **Optional** (Can run outside transaction for better performance) |
31-
| **Vector Operations** | `db.create_vector_index()` | **Auto-transactional** (Do NOT wrap) |
31+
| **Vector Operations** | `CREATE INDEX ... LSM_VECTOR` | **Auto-transactional** (Do NOT wrap) |
3232

3333
### Key Distinction: `db.query()` vs `db.command()`
3434

bindings/python/docs/api/vector.md

Lines changed: 100 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -106,14 +106,42 @@ print(type(py_list)) # <class 'list'>
106106

107107
Wrapper for ArcadeDB's vector index, providing similarity search capabilities.
108108

109-
Creation and configuration fit well in the Python object API. For search, prefer SQL
110-
or Cypher when you need filtering, projection, self-exclusion, or custom score
111-
shaping. The Python search methods below are convenience helpers for simple
112-
embedded-mode workflows.
109+
For creation, prefer SQL `CREATE INDEX ... LSM_VECTOR METADATA {...}`. For search,
110+
prefer SQL or Cypher when you need filtering, projection, self-exclusion, or custom
111+
score shaping. The Python methods below are convenience helpers for simple
112+
embedded-mode workflows and for advanced/manual control. They are not the primary
113+
application-facing workflow this documentation recommends.
114+
115+
For SQL snippets in this documentation, use `vectorNeighbors(...)` by default. The
116+
engine also exposes an equivalent dotted canonical function name, but the alias is more
117+
ergonomic in SQL because it does not require backticks.
118+
119+
### Preferred Creation: SQL
120+
121+
Use SQL as the default creation surface:
122+
123+
```python
124+
db.command(
125+
"sql",
126+
"""
127+
CREATE INDEX ON Document (embedding)
128+
LSM_VECTOR
129+
METADATA {
130+
"dimensions": 384,
131+
"similarity": "COSINE"
132+
}
133+
"""
134+
)
135+
```
136+
137+
SQL builds the vector graph immediately by default. Add `"buildGraphNow": false` only
138+
if you intentionally want lazy preparation.
113139

114140
### Creation via Database
115141

116-
Vector indexes are created using the `Database.create_vector_index()` method:
142+
`Database.create_vector_index()` still exists, but it should be treated as a secondary,
143+
Python-driven helper for manual setup, tests, and API completeness rather than the
144+
primary documented workflow.
117145

118146
**Signature:**
119147

@@ -177,8 +205,6 @@ db.create_vector_index(
177205

178206
```python
179207
import arcadedb_embedded as arcadedb
180-
from arcadedb_embedded import to_java_float_array
181-
import numpy as np
182208

183209
# Create database and schema
184210
db = arcadedb.create_database("./vector_db")
@@ -189,7 +215,7 @@ db.command("sql", "CREATE PROPERTY Document.text STRING")
189215
db.command("sql", "CREATE PROPERTY Document.embedding ARRAY_OF_FLOATS")
190216
db.command("sql", "CREATE INDEX ON Document (id) UNIQUE")
191217

192-
# Create vector index
218+
# Secondary option: create vector index from Python
193219
index = db.create_vector_index(
194220
vertex_type="Document",
195221
vector_property="embedding",
@@ -209,9 +235,14 @@ print(f"Created vector index: {index}")
209235

210236
Find k-nearest neighbors to the query vector.
211237

238+
Treat this as a helper/manual API. For normal application queries, prefer SQL
239+
`vectorNeighbors` so search composes naturally with filtering, projection, and record
240+
exclusion.
241+
212242
**Note:** With default settings (`build_graph_now=True` in `create_vector_index`), graph
213-
preparation runs during index creation. If you set `build_graph_now=False`, the first call
214-
to `find_nearest` may perform lazy graph preparation and therefore take longer.
243+
preparation runs during index creation. In the preferred SQL path, this eager behavior is
244+
also the default. If you explicitly disable eager preparation, the first call to
245+
`find_nearest` may perform lazy graph preparation and therefore take longer.
215246

216247
**Parameters:**
217248

@@ -264,7 +295,7 @@ rows = db.query(
264295
"sql",
265296
(
266297
"SELECT id, distance, (1 - distance) AS score "
267-
"FROM (SELECT expand(`vector.neighbors`('Document[embedding]', "
298+
"FROM (SELECT expand(vectorNeighbors('Document[embedding]', "
268299
f"{qvec_literal}, 10))) WHERE id <> ? ORDER BY distance LIMIT 5"
269300
),
270301
"doc-42",
@@ -292,6 +323,9 @@ Find nearest neighbors by reusing the vector stored on an existing record.
292323
This is the Python wrapper for the common "search from an existing record" workflow,
293324
using the index's configured `id_property` to look up the source vector first.
294325

326+
Treat this as a convenience helper. If you need the recommended query surface, use SQL
327+
or Cypher instead.
328+
295329
**Parameters:**
296330

297331
- `key`: Value of the configured ID property
@@ -350,14 +384,20 @@ print(meta["dimensions"], meta["similarity_function"], meta["id_property"])
350384

351385
Force an immediate rebuild/preparation of the vector graph.
352386

387+
This is a maintenance API, not part of the normal SQL-first creation workflow.
388+
353389
Use this when you want to control when rebuild cost is paid, for example:
354390

355391
- after bulk inserts,
356392
- after bulk deletes/removals,
357393
- before opening traffic after large vector mutations.
358394

359-
This is especially useful if you created the index with `build_graph_now=False` and want
360-
to avoid rebuild work on the first query.
395+
This is especially useful if you created the index with `build_graph_now=False` or with
396+
SQL metadata `"buildGraphNow": false` and want to avoid rebuild work on the first query.
397+
398+
When you create an `LSM_VECTOR` index through SQL, ArcadeDB now builds the graph
399+
immediately by default. Use `build_graph_now()` only for explicit maintenance or when
400+
you intentionally deferred graph preparation.
361401

362402
**Returns:**
363403

@@ -395,16 +435,23 @@ db.command("sql", "CREATE PROPERTY Document.content STRING")
395435
db.command("sql", "CREATE PROPERTY Document.embedding ARRAY_OF_FLOATS")
396436
db.command("sql", "CREATE INDEX ON Document (id) UNIQUE")
397437

398-
# Create vector index (384 dimensions for all-MiniLM-L6-v2)
399-
index = db.create_vector_index(
400-
vertex_type="Document",
401-
vector_property="embedding",
402-
dimensions=384,
403-
distance_function="cosine",
404-
max_connections=16,
405-
beam_width=100 # Default beam width
438+
# Preferred: create vector index in SQL
439+
db.command(
440+
"sql",
441+
"""
442+
CREATE INDEX ON Document (embedding)
443+
LSM_VECTOR
444+
METADATA {
445+
"dimensions": 384,
446+
"similarity": "COSINE",
447+
"maxConnections": 16,
448+
"beamWidth": 100
449+
}
450+
"""
406451
)
407452

453+
index = db.schema.get_vector_index("Document", "embedding")
454+
408455
# Sample documents
409456
documents = [
410457
{"id": "doc1", "title": "Python Tutorial",
@@ -471,13 +518,21 @@ db.command("sql", "CREATE PROPERTY Product.price DECIMAL")
471518
db.command("sql", "CREATE PROPERTY Product.features ARRAY_OF_FLOATS")
472519
db.command("sql", "CREATE INDEX ON Product (category) NOTUNIQUE")
473520

474-
# Create vector index
475-
index = db.create_vector_index(
476-
vertex_type="Product",
477-
vector_property="features",
478-
dimensions=128
521+
# Create vector index in SQL
522+
db.command(
523+
"sql",
524+
"""
525+
CREATE INDEX ON Product (features)
526+
LSM_VECTOR
527+
METADATA {
528+
"dimensions": 128,
529+
"similarity": "COSINE"
530+
}
531+
"""
479532
)
480533

534+
index = db.schema.get_vector_index("Product", "features")
535+
481536
# Add products with feature vectors
482537
products = [
483538
{"id": "p1", "name": "Laptop", "category": "Electronics",
@@ -552,16 +607,23 @@ db.command("sql", "CREATE PROPERTY Image.filename STRING")
552607
db.command("sql", "CREATE PROPERTY Image.path STRING")
553608
db.command("sql", "CREATE PROPERTY Image.embedding ARRAY_OF_FLOATS")
554609

555-
# Create index
556-
index = db.create_vector_index(
557-
vertex_type="Image",
558-
vector_property="embedding",
559-
dimensions=512,
560-
distance_function="cosine",
561-
max_connections=24, # Higher for image search
562-
beam_width=200
610+
# Create index in SQL
611+
db.command(
612+
"sql",
613+
"""
614+
CREATE INDEX ON Image (embedding)
615+
LSM_VECTOR
616+
METADATA {
617+
"dimensions": 512,
618+
"similarity": "COSINE",
619+
"maxConnections": 24,
620+
"beamWidth": 200
621+
}
622+
"""
563623
)
564624

625+
index = db.schema.get_vector_index("Image", "embedding")
626+
565627
# Index images
566628
image_files = ["img1.jpg", "img2.jpg", "img3.jpg"]
567629

@@ -662,7 +724,10 @@ import numpy as np
662724

663725
try:
664726
# Dimension mismatch
665-
index = db.create_vector_index("Doc", "emb", dimensions=384)
727+
db.command(
728+
"sql",
729+
'CREATE INDEX ON Doc (emb) LSM_VECTOR METADATA {"dimensions": 384}',
730+
)
666731

667732
v = db.new_vertex("Doc")
668733
v.set("emb", to_java_float_array(np.random.rand(512))) # Wrong size!

bindings/python/docs/development/build-architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ This document describes the build architecture for creating platform-specific Py
1919

2020
**All supported platforms:**
2121

22-
- ✅ Current suite: 282 passed
22+
- ✅ Current suite: 290 passed
2323
- ✅ 31.7M JARs (83 files, identical across platforms)
2424
- ✅ All native runners (no QEMU emulation)
2525
- ✅ Reproducible builds (pinned runner versions)

bindings/python/docs/development/ci-setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ All 4 platforms passing the bindings suite and example workflows:
102102

103103
| Platforms | Wheel Size | JRE Size | Tests |
104104
|-----------|-----------|----------|-------|
105-
| linux/amd64, linux/arm64, darwin/arm64, windows/amd64 | ~70-75M | ~60M | 282 passed ✅ |
105+
| linux/amd64, linux/arm64, darwin/arm64, windows/amd64 | ~70-75M | ~60M | 290 passed ✅ |
106106

107107
**All platforms include:**
108108

bindings/python/docs/development/contributing.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -620,10 +620,10 @@ mkdocs build
620620
git add src/ tests/ docs/
621621

622622
# Commit with clear message
623-
git commit -m "Add vector search distance function parameter
623+
git commit -m "Refine vector search docs and tests
624624
625-
- Added distance_function parameter to create_vector_index()
626-
- Supports cosine, euclidean, and inner_product
625+
- Clarified SQL-first vector index workflow
626+
- Updated vector docs and tests
627627
- Added tests for all distance functions
628628
- Updated API documentation
629629

bindings/python/docs/development/testing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Comprehensive testing documentation for ArcadeDB Python bindings.
55
!!! success "Test Coverage"
66
Current bindings suite
77

8-
- **Current package**: 282 passed
8+
- **Current package**: 290 passed
99
- All ArcadeDB features working (SQL, OpenCypher, Studio)
1010

1111
## Quick Navigation

bindings/python/docs/development/testing/overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ The ArcadeDB Python bindings have a comprehensive test suite covering all major
55
## Quick Statistics
66

77
!!! success "Test Results"
8-
- **Current package**: ✅ 282 passed
8+
- **Current package**: ✅ 290 passed
99
- Environment-specific skips may vary depending on optional components
1010

1111
## What's Tested
@@ -130,7 +130,7 @@ pytest -m "not slow"
130130
When the current bindings test suite passes, you should see a clean all-green summary.
131131

132132
```
133-
======================== 282 passed ========================
133+
======================== 290 passed ========================
134134
```
135135

136136

0 commit comments

Comments
 (0)