Fix/vector search skill tool names (#364)

jacksandom · web-flow · commit 6bf677655259 · 2026-03-23T12:51:50.000-04:00
* fix: correct vector-search SKILL.md MCP tool names (#326) The MCP Tools section referenced 14 tool names, but the actual MCP server only implements 8. Two tools had wrong names and six don't exist (their functionality is bundled into other tools). Any session using the skill would call non-existent tools and fail silently. Changes: - Rewrote MCP Tools section to document the actual 8 tools - Added explicit code examples showing correct parameter names (e.g. `get_vs_index(index_name=...)` not `get_vs_index(name=...)`) - Added 2 new ground truth test cases to catch regressions Skill eval: 0.646 → 0.727 (+0.081), all 8 test cases corr=yes Co-authored-by: Isaac * Document query_vector MCP truncation issue * Lint fix
diff --git a/.test/skills/databricks-vector-search/ground_truth.yaml b/.test/skills/databricks-vector-search/ground_truth.yaml
@@ -297,6 +297,74 @@ test_cases:
     difficulty: easy
     source: auto_generated
 
+- id: vs_mcp_create_endpoint_008
+  inputs:
+    prompt: "Use MCP tools to create a vector search endpoint called my-rag-endpoint"
+  outputs:
+    response: |
+      ```python
+      result = create_or_update_vs_endpoint(
+          name="my-rag-endpoint",
+          endpoint_type="STANDARD"
+      )
+      ```
+
+      The `create_or_update_vs_endpoint` tool is idempotent — it returns the existing endpoint if one with that name already exists.
+    execution_success: true
+  expectations:
+    expected_facts:
+    - "create_or_update_vs_endpoint is the correct MCP tool name"
+    - "The tool is idempotent"
+    expected_patterns:
+    - pattern: "create_or_update_vs_endpoint"
+      min_count: 1
+      description: "Uses correct MCP tool name (not create_vs_endpoint)"
+    guidelines:
+    - "Must use create_or_update_vs_endpoint, NOT create_vs_endpoint"
+    - "Must mention that the tool is idempotent"
+  metadata:
+    category: happy_path
+    difficulty: easy
+    source: manual
+
+- id: vs_mcp_manage_data_009
+  inputs:
+    prompt: "How do I upsert documents into a Direct Access vector search index using MCP tools?"
+  outputs:
+    response: |
+      ```python
+      result = manage_vs_data(
+          index_name="catalog.schema.my_index",
+          operation="upsert",
+          inputs_json=[
+              {"id": "doc1", "content": "Sample document", "embedding": [0.1, 0.2, ...]},
+              {"id": "doc2", "content": "Another document", "embedding": [0.3, 0.4, ...]}
+          ]
+      )
+      ```
+
+      Use `manage_vs_data` with `operation="upsert"` to insert or update vectors. Other supported operations: `"delete"`, `"scan"`, `"sync"`.
+    execution_success: true
+  expectations:
+    expected_facts:
+    - "manage_vs_data is the correct MCP tool for data operations"
+    - "operation parameter accepts upsert, delete, scan, sync"
+    - "inputs_json contains the vector data to upsert"
+    expected_patterns:
+    - pattern: "manage_vs_data"
+      min_count: 1
+      description: "Uses manage_vs_data (not upsert_vs_data)"
+    - pattern: "upsert"
+      min_count: 1
+      description: "Specifies upsert operation"
+    guidelines:
+    - "Must use manage_vs_data with operation='upsert', NOT upsert_vs_data"
+    - "Must mention other available operations (delete, scan, sync)"
+  metadata:
+    category: happy_path
+    difficulty: medium
+    source: manual
+
 - id: vs_embedding_models_007
   inputs:
     prompt: "What embedding models are available for vector search indexes?"
diff --git a/databricks-skills/databricks-vector-search/SKILL.md b/databricks-skills/databricks-vector-search/SKILL.md
@@ -292,6 +292,7 @@ databricks vector-search indexes delete-index \
 | **Embedding dimension mismatch** | Ensure query and index dimensions match |
 | **Index not updating** | Check pipeline_type; use sync_index() for TRIGGERED |
 | **Out of capacity** | Upgrade to Storage-Optimized (1B+ vectors) |
+| **`query_vector` truncated by MCP tool** | MCP tool calls serialize arrays as JSON and can truncate large vectors (e.g. 1024-dim). Use `query_text` instead (for managed embedding indexes), or use the Databricks SDK/CLI to pass raw vectors |
 
 ## Embedding Models
 
@@ -320,29 +321,74 @@ The following MCP tools are available for managing Vector Search infrastructure.
 
 | Tool | Description |
 |------|-------------|
-| `create_vs_endpoint` | Create endpoint (STANDARD or STORAGE_OPTIMIZED). Async — check status with `get_vs_endpoint` |
-| `get_vs_endpoint` | Get endpoint details and status by name |
-| `list_vs_endpoints` | List all Vector Search endpoints in the workspace |
-| `delete_vs_endpoint` | Delete an endpoint (indexes must be deleted first) |
+| `create_or_update_vs_endpoint` | Create or update an endpoint (STANDARD or STORAGE_OPTIMIZED). Idempotent — returns existing if found |
+| `get_vs_endpoint` | Get endpoint details by name. Omit `name` to list all endpoints in the workspace |
+| `delete_vs_endpoint` | Delete an endpoint (all indexes must be deleted first) |
+
+```python
+# Create or update an endpoint
+result = create_or_update_vs_endpoint(name="my-vs-endpoint", endpoint_type="STANDARD")
+# Returns {"name": "my-vs-endpoint", "endpoint_type": "STANDARD", "created": True}
+
+# List all endpoints
+endpoints = get_vs_endpoint()  # omit name to list all
+```
 
 ### Index Management
 
 | Tool | Description |
 |------|-------------|
-| `create_vs_index` | Create a Delta Sync or Direct Access index on an endpoint |
-| `get_vs_index` | Get index details, status, and configuration |
-| `list_vs_indexes` | List all indexes on an endpoint |
-| `delete_vs_index` | Delete an index |
-| `sync_vs_index` | Trigger sync for TRIGGERED pipeline indexes |
+| `create_or_update_vs_index` | Create or update an index. Idempotent — auto-triggers initial sync for DELTA_SYNC indexes |
+| `get_vs_index` | Get index details by `index_name`. Pass `endpoint_name` (no `index_name`) to list all indexes on an endpoint |
+| `delete_vs_index` | Delete an index by fully-qualified name (catalog.schema.index_name) |
+
+```python
+# Create a Delta Sync index with managed embeddings
+result = create_or_update_vs_index(
+    name="catalog.schema.my_index",
+    endpoint_name="my-vs-endpoint",
+    primary_key="id",
+    index_type="DELTA_SYNC",
+    delta_sync_index_spec={
+        "source_table": "catalog.schema.docs",
+        "embedding_source_columns": [{"name": "content", "embedding_model_endpoint_name": "databricks-gte-large-en"}],
+        "pipeline_type": "TRIGGERED"
+    }
+)
+
+# Get a specific index by name — parameter is index_name, not name
+index = get_vs_index(index_name="catalog.schema.my_index")
+
+# List all indexes on an endpoint
+indexes = get_vs_index(endpoint_name="my-vs-endpoint")
+```
 
 ### Query and Data
 
 | Tool | Description |
 |------|-------------|
-| `query_vs_index` | Query index with `query_text`, `query_vector`, or hybrid (`query_type="HYBRID"`) |
-| `upsert_vs_data` | Upsert vectors into a Direct Access index |
-| `delete_vs_data` | Delete vectors from a Direct Access index by primary key |
-| `scan_vs_index` | Retrieve all vectors from an index (for debugging/export) |
+| `query_vs_index` | Query index with `query_text`, `query_vector`, or hybrid (`query_type="HYBRID"`). Prefer `query_text` over `query_vector` — MCP tool calls can truncate large embedding arrays (1024-dim) |
+| `manage_vs_data` | CRUD operations on Direct Access indexes. `operation`: `"upsert"`, `"delete"`, `"scan"`, `"sync"` |
+
+```python
+# Query an index
+results = query_vs_index(
+    index_name="catalog.schema.my_index",
+    columns=["id", "content"],
+    query_text="machine learning best practices",
+    num_results=5
+)
+
+# Upsert data into a Direct Access index
+manage_vs_data(
+    index_name="catalog.schema.my_index",
+    operation="upsert",
+    inputs_json=[{"id": "doc1", "content": "...", "embedding": [0.1, 0.2, ...]}]
+)
+
+# Trigger manual sync for a TRIGGERED pipeline index
+manage_vs_data(index_name="catalog.schema.my_index", operation="sync")
+```
 
 ## Notes
 
diff --git a/databricks-tools-core/databricks_tools_core/sql/sql_utils/executor.py b/databricks-tools-core/databricks_tools_core/sql/sql_utils/executor.py
@@ -87,6 +87,7 @@ def execute(
             exec_params["row_limit"] = row_limit
         if query_tags:
             from databricks.sdk.service.sql import QueryTag
+
             exec_params["query_tags"] = [
                 QueryTag(key=k.strip(), value=v.strip())
                 for pair in query_tags.split(",")