|
1 | 1 | # Vector Search Guide |
2 | 2 |
|
3 | | -Vector search enables semantic similarity search using embeddings from machine learning models. This guide covers strategies, best practices, and patterns for implementing vector search with ArcadeDB. |
| 3 | +Vector search enables semantic similarity search using embeddings from machine learning |
| 4 | +models. This guide covers strategies, best practices, and patterns for implementing |
| 5 | +vector search with ArcadeDB. |
4 | 6 |
|
5 | 7 | ## Overview |
6 | 8 |
|
7 | | -Vector search transforms your data into high-dimensional vectors (embeddings) and finds similar items using distance metrics. Perfect for: |
| 9 | +Vector search transforms your data into high-dimensional vectors (embeddings) and finds |
| 10 | +similar items using distance metrics. Perfect for: |
8 | 11 |
|
9 | 12 | - **Semantic Search**: Find documents by meaning, not just keywords |
10 | 13 | - **Recommendation Systems**: Find similar products, users, or content |
@@ -272,66 +275,91 @@ index = db.create_vector_index( |
272 | 275 |
|
273 | 276 | ## Index Parameters |
274 | 277 |
|
275 | | -### Max Connections (m) |
| 278 | +### Max Connections |
276 | 279 |
|
277 | | -Controls connections per node in the graph. Maps to `maxConnections` in JVector. |
| 280 | +Controls connections per node in the graph. Maps to `maxConnections` in JVector and `M` |
| 281 | +in HNSW. |
278 | 282 |
|
279 | 283 | ```python |
280 | 284 | index = db.create_vector_index( |
281 | 285 | vertex_type="Doc", |
282 | 286 | vector_property="embedding", |
283 | 287 | dimensions=384, |
284 | | - max_connections=16 # Number of connections |
| 288 | + max_connections=32 # Number of connections (default: 32) |
285 | 289 | ) |
286 | 290 | ``` |
287 | 291 |
|
288 | 292 | **Trade-offs:** |
289 | 293 |
|
290 | 294 | | Max Connections | Recall | Memory | Build Speed | Search Speed | |
291 | 295 | |-----------------|--------|--------|-------------|--------------| |
292 | | -| 8-12 | Lower | Low | Fast | Fast | |
293 | | -| 16-24 | Good | Medium | Medium | Medium | |
294 | | -| 32-48 | High | High | Slow | Slow | |
| 296 | +| 16 | Good | Low | Fast | Fast | |
| 297 | +| 32 (Default) | Decent | Medium | Medium | Medium | |
| 298 | +| 64 | High | High | Slow | Slow | |
295 | 299 |
|
296 | 300 | **Recommendations:** |
297 | 301 | - **Small datasets (<100K)**: max_connections=16 |
298 | | -- **Medium datasets (100K-1M)**: max_connections=24 |
299 | | -- **Large datasets (>1M)**: max_connections=32-48 |
| 302 | +- **Medium datasets (100K-1M)**: max_connections=32 (default) |
| 303 | +- **Large datasets (>1M)**: max_connections=64 |
300 | 304 |
|
301 | 305 | --- |
302 | 306 |
|
303 | 307 | ### Beam Width (ef) |
304 | 308 |
|
305 | | -Controls search quality vs speed. Maps to `beamWidth` in JVector. |
| 309 | +Controls search quality vs speed. Maps to `beamWidth` in JVector and `ef_construction` |
| 310 | +in HNSW. |
306 | 311 |
|
307 | 312 | ```python |
308 | 313 | index = db.create_vector_index( |
309 | 314 | vertex_type="Doc", |
310 | 315 | vector_property="embedding", |
311 | 316 | dimensions=384, |
312 | | - beam_width=128 # Search candidate list size |
| 317 | + beam_width=256 # Search candidate list size (default: 256) |
313 | 318 | ) |
314 | 319 | ``` |
315 | 320 |
|
316 | 321 | **Trade-offs:** |
317 | 322 |
|
318 | 323 | | Beam Width | Recall | Search Speed | |
319 | 324 | |------------|--------|--------------| |
320 | | -| 50-100 | Lower | Fast | |
321 | | -| 128-200 | Good | Medium | |
322 | | -| 200-400 | High | Slow | |
| 325 | +| <256 | Good | Fast | |
| 326 | +| 256 (Def) | Medium | Medium | |
| 327 | +| >256 | High | Slow | |
323 | 328 |
|
324 | 329 | **Recommendations:** |
325 | | -- **Fast search**: beam_width=50-100 |
326 | | -- **Balanced**: beam_width=128-200 |
327 | | -- **High accuracy**: beam_width=200-400 |
| 330 | +- **Fast search**: beam_width=128 |
| 331 | +- **Balanced**: beam_width=256 (default) |
| 332 | +- **High accuracy**: beam_width=512 |
| 333 | + |
| 334 | +--- |
| 335 | + |
| 336 | +### Overquery Factor |
| 337 | + |
| 338 | +Controls search-time accuracy by exploring more candidates than requested. This is |
| 339 | +similar to `efSearch` from HNSW. |
| 340 | + |
| 341 | +```python |
| 342 | +# Actual search will explore k * overquery_factor candidates |
| 343 | +results = index.find_nearest( |
| 344 | + query_embedding, |
| 345 | + k=10, |
| 346 | + overquery_factor=16 # Default: 16 |
| 347 | +) |
| 348 | +``` |
| 349 | + |
| 350 | +**Trade-offs:** |
| 351 | + |
| 352 | +| Factor | Recall | Search Speed | |
| 353 | +|--------|--------|--------------| |
| 354 | +| <16 | Low | Fast | |
| 355 | +| 16 | Decent | Medium | |
| 356 | +| >16 | High | Slow | |
328 | 357 |
|
329 | | -**Recommendations:** |
330 | | -- **Fast iteration**: ef_construction=100 |
331 | | -- **Production**: ef_construction=200 |
332 | | -- **Maximum quality**: ef_construction=400 |
333 | 358 |
|
334 | | -**Note:** Higher ef_construction improves recall but only affects index building, not search. |
| 359 | +**Recommendations:** |
| 360 | +- **Fast search**: overquery_factor=8 |
| 361 | +- **Balanced**: overquery_factor=16 (default) |
| 362 | +- **High accuracy**: overquery_factor=32 |
335 | 363 |
|
336 | 364 | ## Schema Design |
337 | 365 |
|
@@ -426,10 +454,33 @@ for vertex, distance in results: |
426 | 454 |
|
427 | 455 | ### Hybrid Search (Vector + Filters) |
428 | 456 |
|
429 | | -Combine vector similarity with metadata filters: |
| 457 | +Combine vector similarity with metadata filters. |
| 458 | + |
| 459 | +**Option 1: Pre-filtering (Recommended)** |
| 460 | + |
| 461 | +Filter candidates *before* vector search using `allowed_rids`. This is more efficient as |
| 462 | +it ensures you get `k` results that match your criteria. |
| 463 | + |
| 464 | +```python |
| 465 | +# 1. Query for matching RIDs using SQL or index lookup |
| 466 | +rs = db.query("sql", "SELECT @rid FROM Article WHERE category = 'Programming'") |
| 467 | +allowed_rids = [doc.getIdentity().toString() for doc in rs] |
| 468 | + |
| 469 | +# 2. Perform vector search restricted to those RIDs |
| 470 | +query_embedding = model.encode("python tutorial") |
| 471 | +results = index.find_nearest(query_embedding, k=10, allowed_rids=allowed_rids) |
| 472 | + |
| 473 | +for vertex, distance in results: |
| 474 | + print(f"{vertex.get('title')} (distance: {distance:.4f})") |
| 475 | +``` |
| 476 | + |
| 477 | +**Option 2: Post-filtering** |
| 478 | + |
| 479 | +Filter candidates *after* vector search. This is simpler but may return fewer than `k` |
| 480 | +results if many top candidates are filtered out. |
430 | 481 |
|
431 | 482 | ```python |
432 | | -# Get candidates from vector search |
| 483 | +# Get candidates from vector search (oversample with larger k) |
433 | 484 | query_embedding = model.encode("python tutorial") |
434 | 485 | candidates = index.find_nearest(query_embedding, k=100) |
435 | 486 |
|
|
0 commit comments