Skip to content

Commit 08e45e0

Browse files
committed
Refactor vector index creation and querying in Python bindings
- Updated vector index creation to use SQL commands for LSM_VECTOR with metadata. - Enhanced vector search functionality to support key-based lookups and improved error handling. - Added metadata retrieval for vector indices, including dimensions, similarity function, and id property. - Introduced helper methods for querying vectors by key and retrieving index metadata. - Improved example scripts to reflect changes in vector index creation and querying. - Updated tests to cover new functionalities, including key-based searches and metadata validation. - Adjusted example datasets and configurations for better performance and clarity.
1 parent bde69a4 commit 08e45e0

17 files changed

Lines changed: 889 additions & 186 deletions

bindings/python/docs/api/database.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -717,6 +717,17 @@ with db.transaction():
717717
# Search
718718
query_vector = np.random.rand(384)
719719
results = index.find_nearest(query_vector, k=5)
720+
721+
# Preferred when you want richer query composition
722+
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_vector.tolist()) + "]"
723+
rows = db.query(
724+
"sql",
725+
(
726+
"SELECT id, distance, (1 - distance) AS score "
727+
"FROM (SELECT expand(`vector.neighbors`('Document[embedding]', "
728+
f"{qvec_literal}, 5))) ORDER BY distance"
729+
),
730+
).to_list()
720731
```
721732

722733
See [Vector Search Guide](../guide/vectors.md) for details.

bindings/python/docs/api/vector.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,11 @@ print(type(py_list)) # <class 'list'>
106106

107107
Wrapper for ArcadeDB's vector index, providing similarity search capabilities.
108108

109+
Creation and configuration fit well in the Python object API. For search, prefer SQL
110+
or Cypher when you need filtering, projection, self-exclusion, or custom score
111+
shaping. The Python search methods below are convenience helpers for simple
112+
embedded-mode workflows.
113+
109114
### Creation via Database
110115

111116
Vector indexes are created using the `Database.create_vector_index()` method:
@@ -117,6 +122,7 @@ db.create_vector_index(
117122
vertex_type: str,
118123
vector_property: str,
119124
dimensions: int,
125+
id_property: str | None = None,
120126
distance_function: str = "cosine",
121127
max_connections: int = 16,
122128
beam_width: int = 100,
@@ -139,6 +145,8 @@ db.create_vector_index(
139145
- `vertex_type` (str): Vertex type containing vectors
140146
- `vector_property` (str): Property name storing vector arrays
141147
- `dimensions` (int): Vector dimensionality (must match your embeddings)
148+
- `id_property` (str | None): Optional property used for key-based lookup with
149+
`find_nearest_by_key()`. Defaults to the engine default (`"id"`) when omitted.
142150
- `distance_function` (str): Distance metric (default: `"cosine"`)
143151
- `"cosine"`: Cosine distance (1 - cosine similarity)
144152
- `"euclidean"`: Euclidean distance (L2 norm)
@@ -186,6 +194,7 @@ index = db.create_vector_index(
186194
vertex_type="Document",
187195
vector_property="embedding",
188196
dimensions=384, # Match your embedding model
197+
id_property="id",
189198
distance_function="cosine",
190199
max_connections=16,
191200
beam_width=100
@@ -247,6 +256,21 @@ for record, distance in neighbors:
247256
print(f" Text: {text[:100]}...")
248257
```
249258

259+
Preferred for richer query behavior:
260+
261+
```python
262+
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_vector.tolist()) + "]"
263+
rows = db.query(
264+
"sql",
265+
(
266+
"SELECT id, distance, (1 - distance) AS score "
267+
"FROM (SELECT expand(`vector.neighbors`('Document[embedding]', "
268+
f"{qvec_literal}, 10))) WHERE id <> ? ORDER BY distance LIMIT 5"
269+
),
270+
"doc-42",
271+
).to_list()
272+
```
273+
250274
**Distance Interpretation:**
251275

252276
| Function | Range | Similarity direction |
@@ -261,6 +285,67 @@ for record, distance in neighbors:
261285

262286
---
263287

288+
### `VectorIndex.find_nearest_by_key(key, k=10, ef_search=None, allowed_rids=None)`
289+
290+
Find nearest neighbors by reusing the vector stored on an existing record.
291+
292+
This is the Python wrapper for the common "search from an existing record" workflow,
293+
using the index's configured `id_property` to look up the source vector first.
294+
295+
**Parameters:**
296+
297+
- `key`: Value of the configured ID property
298+
- `k` (int): Number of neighbors to return (default: 10)
299+
- `ef_search` (int | None): Optional exact-search beam width override
300+
- `allowed_rids` (List[str] | None): Optional RID whitelist to restrict search
301+
302+
**Returns:**
303+
304+
- `List[Tuple[record, float]]`: Same shape as `find_nearest()`
305+
306+
**Example:**
307+
308+
```python
309+
neighbors = index.find_nearest_by_key("doc-42", k=5)
310+
311+
for record, distance in neighbors:
312+
print(record.get("id"), distance)
313+
```
314+
315+
The helper keeps current nearest-neighbor semantics, so the source record may also be
316+
returned. If you want to exclude it, do that in SQL/Cypher with a `WHERE` clause.
317+
318+
---
319+
320+
### `VectorIndex.get_metadata()`
321+
322+
Return stable vector index metadata as a Python dictionary.
323+
324+
**Returns:**
325+
326+
- `dict` with keys such as:
327+
- `index_name`
328+
- `bucket_index_name`
329+
- `type_name`
330+
- `vector_property`
331+
- `dimensions`
332+
- `similarity_function`
333+
- `id_property`
334+
- `quantization`
335+
- `max_connections`
336+
- `beam_width`
337+
- `store_vectors_in_graph`
338+
- `build_state`
339+
340+
**Example:**
341+
342+
```python
343+
meta = index.get_metadata()
344+
print(meta["dimensions"], meta["similarity_function"], meta["id_property"])
345+
```
346+
347+
---
348+
264349
### `VectorIndex.build_graph_now()`
265350

266351
Force an immediate rebuild/preparation of the vector graph.

bindings/python/docs/examples/03_vector_search.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -78,24 +78,39 @@ index = db.create_vector_index(
7878

7979
### 5. Semantic Search
8080

81-
Find the k most similar documents to a query embedding:
81+
Find the k most similar documents to a query embedding with SQL nearest-neighbor
82+
queries. This keeps search in the query layer, where filtering and score shaping are
83+
easy to express.
8284

8385
```python
8486
query_embedding = create_mock_embedding(category, "query")
8587
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_embedding.tolist()) + "]"
8688
rows = db.query(
87-
"sql",
88-
f"SELECT vectorNeighbors('Article[embedding]', {qvec_literal}, 5) as res",
89+
"sql",
90+
(
91+
"SELECT title, category, distance, (1 - distance) AS score "
92+
"FROM (SELECT expand(`vector.neighbors`('Article[embedding]', "
93+
f"{qvec_literal}, 5))) ORDER BY distance"
94+
),
8995
).to_list()
9096

91-
for hit in rows[0].get("res", []):
92-
vertex = hit.get("record")
93-
distance = hit.get("distance")
94-
if vertex is not None:
95-
print(f"{vertex.get('title')}: {distance:.4f}")
97+
for hit in rows:
98+
print(f"{hit.get('title')}: {hit.get('distance'):.4f}")
9699
```
97100

98-
The `find_nearest()` method returns (vertex, distance) pairs sorted by distance.
101+
The example also shows a filtered query in the same category:
102+
103+
```python
104+
filtered_rows = db.query(
105+
"sql",
106+
(
107+
"SELECT title, category, distance, (1 - distance) AS score "
108+
"FROM (SELECT expand(`vector.neighbors`('Article[embedding]', "
109+
f"{qvec_literal}, 50))) WHERE category = ? ORDER BY distance LIMIT 5"
110+
),
111+
category,
112+
).to_list()
113+
```
99114

100115
## Example Output
101116

bindings/python/docs/examples/vectors.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ with arcadedb.open_database("./vector_demo") as db:
115115
print(f"{record.get('name')}: {record.get('distance'):.4f}")
116116
```
117117

118-
#### SQL nearest-neighbor (preferred for DSL-first code):
118+
#### SQL nearest-neighbor (preferred for query-first code):
119119

120120
```python
121121
import arcadedb_embedded as arcadedb
@@ -135,6 +135,20 @@ with arcadedb.open_database("./vector_demo") as db:
135135
print(f"{record.get('name')}: {distance:.4f}")
136136
```
137137

138+
#### SQL filtered search with score shaping:
139+
140+
```python
141+
rows = db.query(
142+
"sql",
143+
(
144+
"SELECT name, description, distance, (1 - distance) AS score "
145+
"FROM (SELECT expand(`vector.neighbors`('Product[embedding]', "
146+
f"{qvec_literal}, 20))) WHERE name <> ? ORDER BY distance LIMIT 5"
147+
),
148+
"Laptop",
149+
).to_list()
150+
```
151+
138152
## Vector Functions
139153

140154
ArcadeDB provides several vector functions:

bindings/python/docs/guide/vectors.md

Lines changed: 110 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,16 +39,30 @@ with arcadedb.create_database("./vector_demo") as db:
3939
to_java_float_array(embedding),
4040
)
4141

42-
results = index.find_nearest([0.9, 0.1, 0.0], k=2)
43-
for vertex, score in results:
44-
print(vertex.get("text"), score)
42+
rows = db.query(
43+
"sql",
44+
"SELECT vectorNeighbors('Doc[embedding]', [0.9, 0.1, 0.0], 2) as res",
45+
).to_list()
46+
for hit in rows[0].get("res", []):
47+
record = hit.get("record")
48+
if record is not None:
49+
print(record.get("text"), hit.get("distance"))
4550
```
4651

4752
## API Essentials
4853

54+
Preferred split:
55+
56+
- Use Python object API for vector index creation and configuration.
57+
- Prefer SQL or Cypher for vector retrieval/search, because search composes naturally
58+
with filters, projections, and graph traversal.
59+
- Treat `find_nearest()` and `find_nearest_by_key()` as convenience wrappers for
60+
simple embedded-mode workflows.
61+
4962
- Vector property type must be `ARRAY_OF_FLOATS`.
5063
- `create_vector_index(vertex_type, vector_property, dimensions,
51-
distance_function="cosine", max_connections=16, beam_width=100, quantization="INT8",
64+
id_property=None, distance_function="cosine", max_connections=16,
65+
beam_width=100, quantization="INT8",
5266
location_cache_size=None, graph_build_cache_size=None, mutations_before_rebuild=None,
5367
store_vectors_in_graph=False, add_hierarchy=True, pq_subspaces=None, pq_clusters=None,
5468
pq_center_globally=None, pq_training_limit=None, build_graph_now=True)`
@@ -60,6 +74,11 @@ with arcadedb.create_database("./vector_demo") as db:
6074
- `ef_search` optionally overrides the exact-search beam width.
6175
- Leave it as `None` to use ArcadeDB's default/adaptive behavior.
6276
- `allowed_rids` filters candidates server-side (useful for metadata-prefilter).
77+
- `find_nearest_by_key(key, k=10, ef_search=None, allowed_rids=None)`
78+
- Looks up the source vector by the index `id_property` and then runs the same
79+
Python search path as `find_nearest()`.
80+
- `get_metadata()` returns stable index metadata such as dimensions, similarity
81+
function, configured `id_property`, quantization, and cache/build settings.
6382

6483
## Distance Functions (scoring behavior)
6584

@@ -161,6 +180,93 @@ rids = [row.get_rid() for row in db.query("sql", "SELECT @rid FROM Doc WHERE top
161180
results = index.find_nearest(query_vec, k=5, allowed_rids=rids)
162181
```
163182

183+
## Preferred Search Surface: SQL / Cypher
184+
185+
For new code, prefer query APIs for search.
186+
187+
### SQL filtered vector search with score shaping
188+
189+
```python
190+
qvec_literal = "[" + ", ".join(str(float(x)) for x in query_vec) + "]"
191+
192+
rows = db.query(
193+
"sql",
194+
(
195+
"SELECT title, category, distance, (1 - distance) AS score "
196+
"FROM (SELECT expand(`vector.neighbors`('Article[embedding]', "
197+
f"{qvec_literal}, 50))) WHERE category = ? ORDER BY distance LIMIT 5"
198+
),
199+
"category_42",
200+
).to_list()
201+
```
202+
203+
### SQL self-exclusion
204+
205+
```python
206+
rows = db.query(
207+
"sql",
208+
(
209+
"SELECT title, distance, (1 - distance) AS score "
210+
"FROM (SELECT expand(`vector.neighbors`('Movie[embedding]', "
211+
f"{qvec_literal}, 20))) WHERE title <> ? ORDER BY distance LIMIT 10"
212+
),
213+
movie_title,
214+
).to_list()
215+
```
216+
217+
### Cypher search with score shaping
218+
219+
```python
220+
rows = db.query(
221+
"opencypher",
222+
(
223+
"CALL vector.neighbors('Doc[embedding]', $vec, $k) "
224+
"YIELD name, distance RETURN name, (1 - distance) AS score ORDER BY score DESC"
225+
),
226+
{"vec": query_vec, "k": 5},
227+
).to_list()
228+
```
229+
230+
## Search from an Existing Record
231+
232+
```python
233+
with arcadedb.create_database("./vector_demo") as db:
234+
db.command("sql", "CREATE VERTEX TYPE Doc")
235+
db.command("sql", "CREATE PROPERTY Doc.slug STRING")
236+
db.command("sql", "CREATE PROPERTY Doc.embedding ARRAY_OF_FLOATS")
237+
238+
index = db.create_vector_index(
239+
vertex_type="Doc",
240+
vector_property="embedding",
241+
dimensions=3,
242+
id_property="slug",
243+
)
244+
245+
with db.transaction():
246+
db.command(
247+
"sql",
248+
"INSERT INTO Doc SET slug = ?, embedding = ?",
249+
"doc-a",
250+
to_java_float_array([1.0, 0.0, 0.0]),
251+
)
252+
db.command(
253+
"sql",
254+
"INSERT INTO Doc SET slug = ?, embedding = ?",
255+
"doc-b",
256+
to_java_float_array([0.9, 0.1, 0.0]),
257+
)
258+
259+
neighbors = index.find_nearest_by_key("doc-a", k=2)
260+
metadata = index.get_metadata()
261+
262+
print(metadata["dimensions"], metadata["id_property"])
263+
for record, distance in neighbors:
264+
print(record.get("slug"), distance)
265+
```
266+
267+
Use this helper when you want a small embedded-mode shortcut. For richer filtering,
268+
projection, self-exclusion, or score shaping, prefer SQL/Cypher queries.
269+
164270
## Quantization
165271

166272
- `quantization` accepts `"INT8"`, `"BINARY"`, `"PRODUCT"` (PQ), or `None` (full precision).

0 commit comments

Comments
 (0)