Skip to content

Commit f364a82

Browse files
Praneeth16praneeth_paikray-data
andauthored
fix: update databricks-vector-search skill with search modes and operations docs (#318)
Add two new reference files and update existing docs for the databricks-vector-search skill: - search-modes.md: ANN vs HYBRID vs FULL_TEXT decision guide, filter combinations, self-managed embedding patterns - troubleshooting-and-operations.md: endpoint/index monitoring, cost optimization, capacity planning, migration workflows - SKILL.md: expand hybrid search section, add reference file links, update latency numbers and filter syntax - end-to-end-rag.md: align filter parameter names with databricks-vectorsearch package Co-authored-by: Isaac Co-authored-by: praneeth_paikray-data <praneeth.paikray@databricks.com>
1 parent 5528628 commit f364a82

File tree

4 files changed

+350
-21
lines changed

4 files changed

+350
-21
lines changed

databricks-skills/databricks-vector-search/SKILL.md

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@ Databricks Vector Search provides managed vector similarity search with automati
3131

3232
| Type | Latency | Capacity | Cost | Best For |
3333
|------|---------|----------|------|----------|
34-
| **Standard** | ~50-100ms | 320M vectors (768 dim) | Higher | Real-time, low-latency |
35-
| **Storage-Optimized** | ~250ms | 1B+ vectors (768 dim) | 7x lower | Large-scale, cost-sensitive |
34+
| **Standard** | 20-50ms | 320M vectors (768 dim) | Higher | Real-time, low-latency |
35+
| **Storage-Optimized** | 300-500ms | 1B+ vectors (768 dim) | 7x lower | Large-scale, cost-sensitive |
3636

3737
## Index Types
3838

@@ -184,13 +184,15 @@ results = w.vector_search_indexes.query_index(
184184

185185
### Hybrid Search (Semantic + Keyword)
186186

187+
Hybrid search combines vector similarity (ANN) with BM25 keyword scoring. Use it when queries contain exact terms that must match — SKUs, error codes, proper nouns, or technical terminology — where pure semantic search might miss keyword-specific results. See [search-modes.md](search-modes.md) for detailed guidance on choosing between ANN and hybrid search.
188+
187189
```python
188190
# Combines vector similarity with keyword matching
189191
results = w.vector_search_indexes.query_index(
190192
index_name="catalog.schema.my_index",
191193
columns=["id", "content"],
192-
query_text="machine learning algorithms",
193-
query_type="hybrid", # Enable hybrid search
194+
query_text="SPARK-12345 executor memory error",
195+
query_type="HYBRID",
194196
num_results=10
195197
)
196198
```
@@ -212,20 +214,26 @@ results = w.vector_search_indexes.query_index(
212214

213215
### Storage-Optimized Filters (SQL-like)
214216

217+
Storage-Optimized endpoints use SQL-like filter syntax via the `databricks-vectorsearch` package's `filters` parameter (accepts a string):
218+
215219
```python
216-
# filter_string uses SQL-like syntax
217-
results = w.vector_search_indexes.query_index(
218-
index_name="catalog.schema.my_index",
219-
columns=["id", "content"],
220+
from databricks.vector_search.client import VectorSearchClient
221+
222+
vsc = VectorSearchClient()
223+
index = vsc.get_index(endpoint_name="my-storage-endpoint", index_name="catalog.schema.my_index")
224+
225+
# SQL-like filter syntax for storage-optimized endpoints
226+
results = index.similarity_search(
220227
query_text="machine learning",
228+
columns=["id", "content"],
221229
num_results=10,
222-
filter_string="category = 'ai' AND status IN ('active', 'pending')"
230+
filters="category = 'ai' AND status IN ('active', 'pending')"
223231
)
224232

225233
# More filter examples
226-
filter_string="price > 100 AND price < 500"
227-
filter_string="department LIKE 'eng%'"
228-
filter_string="created_at >= '2024-01-01'"
234+
# filters="price > 100 AND price < 500"
235+
# filters="department LIKE 'eng%'"
236+
# filters="created_at >= '2024-01-01'"
229237
```
230238

231239
### Trigger Index Sync
@@ -253,6 +261,8 @@ scan_result = w.vector_search_indexes.scan_index(
253261
|-------|------|-------------|
254262
| Index Types | [index-types.md](index-types.md) | Detailed comparison of Delta Sync (managed/self-managed) vs Direct Access |
255263
| End-to-End RAG | [end-to-end-rag.md](end-to-end-rag.md) | Complete walkthrough: source table → endpoint → index → query → agent integration |
264+
| Search Modes | [search-modes.md](search-modes.md) | When to use semantic (ANN) vs hybrid search, decision guide |
265+
| Operations | [troubleshooting-and-operations.md](troubleshooting-and-operations.md) | Monitoring, cost optimization, capacity planning, migration |
256266

257267
## CLI Quick Reference
258268

@@ -288,7 +298,7 @@ databricks vector-search indexes delete-index \
288298
|-------|----------|
289299
| **Index sync slow** | Use Storage-Optimized endpoints (20x faster indexing) |
290300
| **Query latency high** | Use Standard endpoint for <100ms latency |
291-
| **filters_json not working** | Storage-Optimized uses `filter_string` (SQL syntax) |
301+
| **filters_json not working** | Storage-Optimized uses SQL-like string filters via `databricks-vectorsearch` package's `filters` parameter |
292302
| **Embedding dimension mismatch** | Ensure query and index dimensions match |
293303
| **Index not updating** | Check pipeline_type; use sync_index() for TRIGGERED |
294304
| **Out of capacity** | Upgrade to Storage-Optimized (1B+ vectors) |
@@ -298,10 +308,10 @@ databricks vector-search indexes delete-index \
298308

299309
Databricks provides built-in embedding models:
300310

301-
| Model | Dimensions | Use Case |
302-
|-------|------------|----------|
303-
| `databricks-gte-large-en` | 1024 | English text, high quality |
304-
| `databricks-bge-large-en` | 1024 | English text, general |
311+
| Model | Dimensions | Context Window | Use Case |
312+
|-------|------------|----------------|----------|
313+
| `databricks-gte-large-en` | 1024 | 8192 tokens | English text, high quality |
314+
| `databricks-bge-large-en` | 1024 | 512 tokens | English text, general purpose |
305315

306316
```python
307317
# Use with managed embeddings
@@ -396,7 +406,7 @@ manage_vs_data(index_name="catalog.schema.my_index", operation="sync")
396406
- **Delta Sync recommended** — easier than Direct Access for most scenarios
397407
- **Hybrid search** — available for both Delta Sync and Direct Access indexes
398408
- **`columns_to_sync` matters** — only synced columns are available in query results; include all columns you need
399-
- **Filter syntax differs by endpoint** — Standard uses `filters_json` (dict), Storage-Optimized uses `filter_string` (SQL)
409+
- **Filter syntax differs by endpoint** — Standard uses dict-format filters, Storage-Optimized uses SQL-like string filters. Use the `databricks-vectorsearch` package's `filters` parameter which accepts both formats
400410
- **Management vs runtime** — MCP tools above handle lifecycle management; for agent tool-calling at runtime, use `VectorSearchRetrieverTool` or the Databricks managed Vector Search MCP server
401411

402412
## Related Skills

databricks-skills/databricks-vector-search/end-to-end-rag.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -119,13 +119,13 @@ query_vs_index(
119119
The filter syntax depends on the endpoint type used when creating the index.
120120

121121
```python
122-
# Storage-Optimized endpoint (used in this walkthrough): SQL-like filter_string
122+
# Storage-Optimized endpoint (used in this walkthrough): SQL-like filter syntax
123123
query_vs_index(
124124
index_name="catalog.schema.knowledge_base_index",
125125
columns=["doc_id", "title", "content"],
126126
query_text="How do I govern my data?",
127127
num_results=3,
128-
filter_string="category = 'governance'"
128+
filters="category = 'governance'"
129129
)
130130

131131
# Standard endpoint (if you created a Standard endpoint instead): JSON filters_json
@@ -236,4 +236,4 @@ Then sync — the index automatically handles deletions via Delta change data fe
236236
| **"Column not found in index"** | Column must be in `columns_to_sync`. Recreate index with the column included |
237237
| **Embeddings not computed** | Ensure `embedding_model_endpoint_name` is a valid serving endpoint |
238238
| **Stale results after table update** | For TRIGGERED pipelines, you must call `sync_vs_index` manually |
239-
| **Filter not working** | Standard endpoints use `filters_json` (dict), Storage-Optimized use `filter_string` (SQL) |
239+
| **Filter not working** | Standard endpoints use dict-format filters (`filters_json`), Storage-Optimized use SQL-like string filters (`filters`) |
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Vector Search Modes
2+
3+
Databricks Vector Search supports three search modes: **ANN** (semantic, default), **HYBRID** (semantic + keyword), and **FULL_TEXT** (keyword only, beta). ANN and HYBRID work with Delta Sync and Direct Access indexes.
4+
5+
## Semantic Search (ANN)
6+
7+
ANN (Approximate Nearest Neighbor) is the default search mode. It finds documents by vector similarity — matching the *meaning* of your query against stored embeddings.
8+
9+
### When to use
10+
11+
- Conceptual or meaning-based queries ("How do I handle errors in my pipeline?")
12+
- Paraphrased input where exact terms may not appear in the documents
13+
- Multilingual scenarios where query and document languages may differ
14+
- General-purpose RAG retrieval
15+
16+
### Example
17+
18+
```python
19+
# ANN is the default — no query_type parameter needed
20+
results = w.vector_search_indexes.query_index(
21+
index_name="catalog.schema.my_index",
22+
columns=["id", "content"],
23+
query_text="How do I handle errors in my pipeline?",
24+
num_results=5
25+
)
26+
```
27+
28+
## Hybrid Search
29+
30+
Hybrid search combines vector similarity (ANN) with BM25 keyword scoring. It retrieves documents that are both semantically similar *and* contain matching keywords, then merges the results.
31+
32+
### When to use
33+
34+
- Queries containing exact terms that must appear: SKUs, product codes, error codes, acronyms
35+
- Proper nouns — company names, people, specific technologies
36+
- Technical documentation where terminology precision matters
37+
- Mixed-intent queries combining concepts with specific terms
38+
39+
### Example
40+
41+
```python
42+
results = w.vector_search_indexes.query_index(
43+
index_name="catalog.schema.my_index",
44+
columns=["id", "content"],
45+
query_text="SPARK-12345 executor memory error",
46+
query_type="HYBRID",
47+
num_results=10
48+
)
49+
```
50+
51+
## Decision Guide
52+
53+
| Mode | Best for | Trade-off | Choose when |
54+
|------|----------|-----------|-------------|
55+
| **ANN** (default) | Conceptual queries, paraphrases, meaning-based search | Fastest; may miss exact keyword matches | You want documents *about* a topic regardless of exact wording |
56+
| **HYBRID** | Exact terms, codes, proper nouns, mixed-intent queries | ~2x resource usage vs ANN; max 200 results | Your queries contain specific identifiers or technical terms that must appear in results |
57+
| **FULL_TEXT** (beta) | Pure keyword search without vector embeddings | No semantic understanding; max 200 results | You need keyword matching only, without vector similarity |
58+
59+
**Start with ANN.** Switch to HYBRID if you notice relevant documents being missed because they don't share vocabulary with the query.
60+
61+
## Combining Search Modes with Filters
62+
63+
Both search modes support filters. The filter syntax depends on your endpoint type:
64+
65+
- **Standard endpoints**`filters` as dict (or `filters_json` as JSON string via `databricks-sdk`)
66+
- **Storage-Optimized endpoints**`filters` as SQL-like string (via `databricks-vectorsearch` package)
67+
68+
### Standard endpoint with hybrid search
69+
70+
```python
71+
results = w.vector_search_indexes.query_index(
72+
index_name="catalog.schema.my_index",
73+
columns=["id", "content", "category"],
74+
query_text="SPARK-12345 executor memory error",
75+
query_type="HYBRID",
76+
num_results=10,
77+
filters_json='{"category": "troubleshooting", "status": ["open", "in_progress"]}'
78+
)
79+
```
80+
81+
### Storage-Optimized endpoint with hybrid search
82+
83+
```python
84+
from databricks.vector_search.client import VectorSearchClient
85+
86+
vsc = VectorSearchClient()
87+
index = vsc.get_index(endpoint_name="my-storage-endpoint", index_name="catalog.schema.my_index")
88+
89+
results = index.similarity_search(
90+
query_text="SPARK-12345 executor memory error",
91+
columns=["id", "content", "category"],
92+
query_type="hybrid",
93+
num_results=10,
94+
filters="category = 'troubleshooting' AND status IN ('open', 'in_progress')"
95+
)
96+
```
97+
98+
## Using with Pre-Computed Embeddings
99+
100+
If you compute embeddings yourself, use `query_vector` instead of `query_text` for ANN search:
101+
102+
```python
103+
# ANN with pre-computed embedding (default)
104+
results = w.vector_search_indexes.query_index(
105+
index_name="catalog.schema.my_index",
106+
columns=["id", "content"],
107+
query_vector=[0.1, 0.2, 0.3, ...], # Your embedding vector
108+
num_results=10
109+
)
110+
```
111+
112+
For **hybrid search with self-managed embeddings** (indexes without an associated model endpoint), you must provide **both** `query_vector` and `query_text`. The vector is used for the ANN component and the text for the BM25 keyword component:
113+
114+
```python
115+
# HYBRID with self-managed embeddings — requires both vector AND text
116+
results = w.vector_search_indexes.query_index(
117+
index_name="catalog.schema.my_index",
118+
columns=["id", "content"],
119+
query_vector=[0.1, 0.2, 0.3, ...], # For ANN similarity
120+
query_text="executor memory error", # For BM25 keyword matching
121+
query_type="HYBRID",
122+
num_results=10
123+
)
124+
```
125+
126+
**Notes:**
127+
- For **ANN** queries: provide either `query_text` or `query_vector`, not both.
128+
- For **HYBRID** queries on **managed embedding indexes**: provide only `query_text` (the system handles both components).
129+
- For **HYBRID** queries on **self-managed indexes without a model endpoint**: provide both `query_vector` and `query_text`.
130+
- When using `query_text` alone, the index must have an associated embedding model (managed embeddings or `embedding_model_endpoint_name` on a Direct Access index).
131+
132+
## Parameter Reference
133+
134+
| Parameter | Type | Package | Description |
135+
|-----------|------|---------|-------------|
136+
| `query_text` | `str` | Both | Text query — requires embedding model on the index |
137+
| `query_vector` | `list[float]` | Both | Pre-computed embedding vector |
138+
| `query_type` | `str` | Both | `"ANN"` (default) or `"HYBRID"` or `"FULL_TEXT"` (beta) |
139+
| `columns` | `list[str]` | Both | Column names to return in results |
140+
| `num_results` | `int` | Both | Number of results (default: 10 in `databricks-sdk`, 5 in `databricks-vectorsearch`) |
141+
| `filters_json` | `str` | `databricks-sdk` | JSON dict filter string (Standard endpoints) |
142+
| `filters` | `str` or `dict` | `databricks-vectorsearch` | Dict for Standard, SQL-like string for Storage-Optimized |

0 commit comments

Comments
 (0)