|
| 1 | +# Vector Search Modes |
| 2 | + |
| 3 | +Databricks Vector Search supports three search modes: **ANN** (semantic, default), **HYBRID** (semantic + keyword), and **FULL_TEXT** (keyword only, beta). ANN and HYBRID work with Delta Sync and Direct Access indexes. |
| 4 | + |
| 5 | +## Semantic Search (ANN) |
| 6 | + |
| 7 | +ANN (Approximate Nearest Neighbor) is the default search mode. It finds documents by vector similarity — matching the *meaning* of your query against stored embeddings. |
| 8 | + |
| 9 | +### When to use |
| 10 | + |
| 11 | +- Conceptual or meaning-based queries ("How do I handle errors in my pipeline?") |
| 12 | +- Paraphrased input where exact terms may not appear in the documents |
| 13 | +- Multilingual scenarios where query and document languages may differ |
| 14 | +- General-purpose RAG retrieval |
| 15 | + |
| 16 | +### Example |
| 17 | + |
| 18 | +```python |
| 19 | +# ANN is the default — no query_type parameter needed |
| 20 | +results = w.vector_search_indexes.query_index( |
| 21 | + index_name="catalog.schema.my_index", |
| 22 | + columns=["id", "content"], |
| 23 | + query_text="How do I handle errors in my pipeline?", |
| 24 | + num_results=5 |
| 25 | +) |
| 26 | +``` |
| 27 | + |
| 28 | +## Hybrid Search |
| 29 | + |
| 30 | +Hybrid search combines vector similarity (ANN) with BM25 keyword scoring. It retrieves documents that are both semantically similar *and* contain matching keywords, then merges the results. |
| 31 | + |
| 32 | +### When to use |
| 33 | + |
| 34 | +- Queries containing exact terms that must appear: SKUs, product codes, error codes, acronyms |
| 35 | +- Proper nouns — company names, people, specific technologies |
| 36 | +- Technical documentation where terminology precision matters |
| 37 | +- Mixed-intent queries combining concepts with specific terms |
| 38 | + |
| 39 | +### Example |
| 40 | + |
| 41 | +```python |
| 42 | +results = w.vector_search_indexes.query_index( |
| 43 | + index_name="catalog.schema.my_index", |
| 44 | + columns=["id", "content"], |
| 45 | + query_text="SPARK-12345 executor memory error", |
| 46 | + query_type="HYBRID", |
| 47 | + num_results=10 |
| 48 | +) |
| 49 | +``` |
| 50 | + |
| 51 | +## Decision Guide |
| 52 | + |
| 53 | +| Mode | Best for | Trade-off | Choose when | |
| 54 | +|------|----------|-----------|-------------| |
| 55 | +| **ANN** (default) | Conceptual queries, paraphrases, meaning-based search | Fastest; may miss exact keyword matches | You want documents *about* a topic regardless of exact wording | |
| 56 | +| **HYBRID** | Exact terms, codes, proper nouns, mixed-intent queries | ~2x resource usage vs ANN; max 200 results | Your queries contain specific identifiers or technical terms that must appear in results | |
| 57 | +| **FULL_TEXT** (beta) | Pure keyword search without vector embeddings | No semantic understanding; max 200 results | You need keyword matching only, without vector similarity | |
| 58 | + |
| 59 | +**Start with ANN.** Switch to HYBRID if you notice relevant documents being missed because they don't share vocabulary with the query. |
| 60 | + |
| 61 | +## Combining Search Modes with Filters |
| 62 | + |
| 63 | +Both search modes support filters. The filter syntax depends on your endpoint type: |
| 64 | + |
| 65 | +- **Standard endpoints** → `filters` as dict (or `filters_json` as JSON string via `databricks-sdk`) |
| 66 | +- **Storage-Optimized endpoints** → `filters` as SQL-like string (via `databricks-vectorsearch` package) |
| 67 | + |
| 68 | +### Standard endpoint with hybrid search |
| 69 | + |
| 70 | +```python |
| 71 | +results = w.vector_search_indexes.query_index( |
| 72 | + index_name="catalog.schema.my_index", |
| 73 | + columns=["id", "content", "category"], |
| 74 | + query_text="SPARK-12345 executor memory error", |
| 75 | + query_type="HYBRID", |
| 76 | + num_results=10, |
| 77 | + filters_json='{"category": "troubleshooting", "status": ["open", "in_progress"]}' |
| 78 | +) |
| 79 | +``` |
| 80 | + |
| 81 | +### Storage-Optimized endpoint with hybrid search |
| 82 | + |
| 83 | +```python |
| 84 | +from databricks.vector_search.client import VectorSearchClient |
| 85 | + |
| 86 | +vsc = VectorSearchClient() |
| 87 | +index = vsc.get_index(endpoint_name="my-storage-endpoint", index_name="catalog.schema.my_index") |
| 88 | + |
| 89 | +results = index.similarity_search( |
| 90 | + query_text="SPARK-12345 executor memory error", |
| 91 | + columns=["id", "content", "category"], |
| 92 | + query_type="hybrid", |
| 93 | + num_results=10, |
| 94 | + filters="category = 'troubleshooting' AND status IN ('open', 'in_progress')" |
| 95 | +) |
| 96 | +``` |
| 97 | + |
| 98 | +## Using with Pre-Computed Embeddings |
| 99 | + |
| 100 | +If you compute embeddings yourself, use `query_vector` instead of `query_text` for ANN search: |
| 101 | + |
| 102 | +```python |
| 103 | +# ANN with pre-computed embedding (default) |
| 104 | +results = w.vector_search_indexes.query_index( |
| 105 | + index_name="catalog.schema.my_index", |
| 106 | + columns=["id", "content"], |
| 107 | + query_vector=[0.1, 0.2, 0.3, ...], # Your embedding vector |
| 108 | + num_results=10 |
| 109 | +) |
| 110 | +``` |
| 111 | + |
| 112 | +For **hybrid search with self-managed embeddings** (indexes without an associated model endpoint), you must provide **both** `query_vector` and `query_text`. The vector is used for the ANN component and the text for the BM25 keyword component: |
| 113 | + |
| 114 | +```python |
| 115 | +# HYBRID with self-managed embeddings — requires both vector AND text |
| 116 | +results = w.vector_search_indexes.query_index( |
| 117 | + index_name="catalog.schema.my_index", |
| 118 | + columns=["id", "content"], |
| 119 | + query_vector=[0.1, 0.2, 0.3, ...], # For ANN similarity |
| 120 | + query_text="executor memory error", # For BM25 keyword matching |
| 121 | + query_type="HYBRID", |
| 122 | + num_results=10 |
| 123 | +) |
| 124 | +``` |
| 125 | + |
| 126 | +**Notes:** |
| 127 | +- For **ANN** queries: provide either `query_text` or `query_vector`, not both. |
| 128 | +- For **HYBRID** queries on **managed embedding indexes**: provide only `query_text` (the system handles both components). |
| 129 | +- For **HYBRID** queries on **self-managed indexes without a model endpoint**: provide both `query_vector` and `query_text`. |
| 130 | +- When using `query_text` alone, the index must have an associated embedding model (managed embeddings or `embedding_model_endpoint_name` on a Direct Access index). |
| 131 | + |
| 132 | +## Parameter Reference |
| 133 | + |
| 134 | +| Parameter | Type | Package | Description | |
| 135 | +|-----------|------|---------|-------------| |
| 136 | +| `query_text` | `str` | Both | Text query — requires embedding model on the index | |
| 137 | +| `query_vector` | `list[float]` | Both | Pre-computed embedding vector | |
| 138 | +| `query_type` | `str` | Both | `"ANN"` (default) or `"HYBRID"` or `"FULL_TEXT"` (beta) | |
| 139 | +| `columns` | `list[str]` | Both | Column names to return in results | |
| 140 | +| `num_results` | `int` | Both | Number of results (default: 10 in `databricks-sdk`, 5 in `databricks-vectorsearch`) | |
| 141 | +| `filters_json` | `str` | `databricks-sdk` | JSON dict filter string (Standard endpoints) | |
| 142 | +| `filters` | `str` or `dict` | `databricks-vectorsearch` | Dict for Standard, SQL-like string for Storage-Optimized | |
0 commit comments