Document multi-vector embeddings and late-interaction search

lukekim · lukekim · commit 750f85ce9ab8 · 2026-04-21T11:10:41.000-07:00
Add a new Multi-Vector Search feature page covering column-of-vectors
embeddings (List&lt;Utf8&gt; source columns), aggregation strategies
(max/mean/sum), max_elements_per_row caps, and ColBERT-style
late-interaction multi-query vector_search.

Cross-link from the search index, vector search, and embeddings
component pages, and document the previously-missing aggregation and
max_elements_per_row fields in the datasets reference.
diff --git a/website/docs/components/embeddings/index.md b/website/docs/components/embeddings/index.md
@@ -303,6 +303,28 @@ datasets:
             row_id: id
 ```
 
+### Multi-Vector Embeddings
+
+When the source column is `List<Utf8>` (or `LargeList<Utf8>`), Spice embeds each list element independently and produces a `List<FixedSizeList<Float32, N>>` column. This is the multi-vector (column-of-vectors) mode, useful for rows that carry several independent pieces of text such as tags, section headings, or historical queries.
+
+```yaml
+datasets:
+  - from: file:products.parquet
+    name: products
+    acceleration:
+      enabled: true
+    columns:
+      - name: tags              # List<Utf8>
+        embeddings:
+          - from: local_embedding_model
+            aggregation: max
+            max_elements_per_row: 64
+```
+
+The `aggregation` field controls how per-element similarities are combined into a per-row score during vector search. `max` (default) is ColBERT-style `MaxSim`; `mean` and `sum` are also supported. The `max_elements_per_row` field caps how many list elements are embedded per row (default `32`, hard limit `1024`). Multi-vector columns also support [ColBERT-style late-interaction search](../features/search/multi-vector#late-interaction-multi-query-search) via an array of query strings.
+
+See [Multi-Vector Search](../features/search/multi-vector) for query usage and [`columns[*].embeddings[*]`](../reference/spicepod/datasets#columnsembeddings) for the full field reference.
+
 import DocCardList from '@theme/DocCardList';
 
 <DocCardList />
diff --git a/website/docs/features/search/index.md b/website/docs/features/search/index.md
@@ -22,6 +22,7 @@ Spice provides comprehensive search capabilities enabling developers to query da
 Spice supports multiple search methods:
 
 - **Vector Search**: Semantic search using embeddings to retrieve data by meaning and similarity.
+- **Multi-Vector Search**: Search over columns of vectors, including ColBERT-style late-interaction queries.
 - **Full-Text Search**: Keyword-driven search optimized for text data retrieval.
 - **Hybrid Search**: Combine multiple search methods using Reciprocal Rank Fusion (RRF) for improved relevance.
 - **SQL Search**: Traditional SQL queries for precise and structured searches.
@@ -52,6 +53,27 @@ LIMIT 5
 
 For complete SQL UDTF specifications, see [Vector-Based Search SQL UDTF](search/vector-search#sql-udtf).
 
+### Multi-Vector Search
+
+Multi-vector search operates on columns that store many vectors per row, such as per-tag or per-section embeddings. It also supports ColBERT-style late-interaction queries where the query itself is an array of strings.
+
+**Requirements:**
+
+- A list-typed source column (`List<Utf8>`) embedded with a multi-vector aggregation
+
+**Getting Started:**
+
+- [Multi-Vector Search Docs](search/multi-vector)
+
+**Example SQL Multi-Vector Search:**
+
+```sql
+SELECT product_id, name, score
+FROM vector_search(products, ['hiking', 'waterproof'], tags)
+ORDER BY score DESC
+LIMIT 10
+```
+
 ### Full-Text Search
 
 Full-text search efficiently retrieves records matching specific keywords.
diff --git a/website/docs/features/search/multi-vector.md b/website/docs/features/search/multi-vector.md
@@ -0,0 +1,110 @@
+---
+title: 'Multi-Vector Search'
+sidebar_label: 'Multi-Vector Search'
+description: 'Embed list-of-strings columns as a column of vectors and use ColBERT-style late-interaction search in Spice.'
+sidebar_position: 3
+tags:
+  - search
+  - embeddings
+  - models
+---
+
+A multi-vector column stores many embedding vectors per row rather than a single vector. Spice produces a multi-vector column by embedding each element of a `List<Utf8>` source column independently, yielding a `List<FixedSizeList<Float32, N>>` embedding column.
+
+Multi-vector embeddings are useful when a single row has several distinct pieces of text — for example, a product with many tags, a paper with multiple titles and section headings, or a user with a set of historical queries. Each element is embedded and scored separately, and per-row results are produced by aggregating the per-element similarities.
+
+## How Multi-Vector Differs from Chunking
+
+Chunking splits one long string (such as a document body) into pieces and embeds each piece. Multi-vector starts from a column that is already a list of independent strings and embeds each list element as-is.
+
+| Source column type | Embedding mode         | Produced embedding type           |
+| ------------------ | ---------------------- | --------------------------------- |
+| `Utf8`             | Scalar (default)       | `FixedSizeList<Float32, N>`       |
+| `Utf8` + chunking  | Chunked                | `List<FixedSizeList<Float32, N>>` |
+| `List<Utf8>`       | Multi-vector (default) | `List<FixedSizeList<Float32, N>>` |
+
+Multi-vector and chunked columns share the same Arrow type, but the per-element offsets column (`<column>_offsets`) is only produced for chunked columns.
+
+## Configuring a Multi-Vector Column
+
+Define an embedding on a `List<Utf8>` column the same way as a scalar string column. Spice detects the list type and embeds each element independently.
+
+```yaml
+datasets:
+  - from: file:products.parquet
+    name: products
+    acceleration:
+      enabled: true
+    columns:
+      - name: tags              # List<Utf8>
+        embeddings:
+          - from: local_embedding_model
+            aggregation: max
+            max_elements_per_row: 64
+
+embeddings:
+  - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
+    name: local_embedding_model
+```
+
+### Aggregation Strategies
+
+When a multi-vector column is queried with a single query string, each element's similarity to the query is computed, and the per-row score is the aggregate of those similarities.
+
+| `aggregation` | Description                                                                        |
+| ------------- | ---------------------------------------------------------------------------------- |
+| `max`         | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
+| `mean`        | Average similarity across elements. Favors rows where most elements are relevant.  |
+| `sum`         | Sum of similarities. Biases toward rows with many matching elements.               |
+
+### Element Caps
+
+Multi-vector columns default to embedding the first 32 elements per row. Raise the cap with `max_elements_per_row` (hard-capped at `1024`). Excess elements are dropped with a warning log so that rows with unbounded tag counts do not blow up embedding cost.
+
+## Querying with `vector_search`
+
+A multi-vector column is queried with the standard `vector_search` UDTF. The configured `aggregation` is applied automatically.
+
+```sql
+SELECT product_id, name, score
+FROM vector_search(products, 'travel accessories', tags)
+ORDER BY score DESC
+LIMIT 10;
+```
+
+## Late-Interaction (Multi-Query) Search
+
+Multi-vector columns also support ColBERT-style late-interaction search, where the query itself is an array of strings. Each query is embedded independently, the best-matching element is selected for each query (`MaxSim`), and the per-row score is the sum across queries:
+
+$$
+\text{score}(d) = \sum_{q \in Q} \max_{e \in d} \cos(q, e)
+$$
+
+```sql
+SELECT product_id, name, score
+FROM vector_search(
+  products,
+  ['hiking', 'waterproof', 'lightweight'],
+  tags
+)
+ORDER BY score DESC
+LIMIT 10;
+```
+
+Late-interaction search is only supported on multi-vector columns; passing an array of queries to a scalar or chunked column returns an error. A maximum of 32 query strings are accepted per call.
+
+## Passthrough Multi-Vector Columns
+
+Datasets that already contain multi-vector columns can be used directly when their schema matches the conventions in [Vector-Based Search](vector-search#using-existing-embeddings):
+
+- Column name: `<original_column>_embedding`
+- Type: `List<FixedSizeList<Float32 or Float64, N>>`
+- No offsets column (that is only required for chunked scalar columns)
+
+Declare the underlying column's embedding in `spicepod.yaml` so that Spice knows which embedding model the existing vectors came from.
+
+## Limitations
+
+- Multi-vector embeddings require the source column to be `List<Utf8>` or `LargeList<Utf8>`.
+- Late-interaction search accepts at most 32 query strings per call.
+- Multi-vector columns cannot currently be stored in an external vector engine; use a [data accelerator](../../components/data-accelerators) with `acceleration.enabled: true` to cache embeddings.
diff --git a/website/docs/features/search/vector-search.md b/website/docs/features/search/vector-search.md
@@ -18,6 +18,8 @@ Vector search uses embeddings (numerical representations of text or data) to fin
 - Retrieval-augmented generation (RAG) applications
 - Recommendation systems
 
+For embedding columns that contain many vectors per row (for example, one vector per tag or per section), see [Multi-Vector Search](multi-vector).
+
 ## Embedding Models
 
 Spice supports two types of embedding providers:
diff --git a/website/docs/reference/spicepod/datasets.md b/website/docs/reference/spicepod/datasets.md
@@ -806,6 +806,38 @@ columns:
         vector_size: 1024
 ```
 
+## `columns[*].embeddings[*].aggregation`
+
+Optional. For multi-vector columns (`List<Utf8>` source), the strategy used to combine per-element similarities into a single per-row score during vector search. Only meaningful when the underlying column is list-typed.
+
+| Value  | Description                                                                        |
+| ------ | ---------------------------------------------------------------------------------- |
+| `max`  | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
+| `mean` | Average similarity across elements.                                                |
+| `sum`  | Sum of similarities across elements.                                               |
+
+See [Multi-Vector Search](../../features/search/multi-vector) for details.
+
+```yaml
+columns:
+  - name: tags
+    embeddings:
+      - from: local_embedding_model
+        aggregation: max
+```
+
+## `columns[*].embeddings[*].max_elements_per_row`
+
+Optional. For multi-vector columns, the maximum number of list elements embedded per row. Defaults to `32`; hard-capped at `1024`. Excess elements are dropped with a warning log.
+
+```yaml
+columns:
+  - name: tags
+    embeddings:
+      - from: local_embedding_model
+        max_elements_per_row: 128
+```
+
 ## `columns[*].full_text_search` {#columns-search-full-text}
 
 ## `columns[*].full_text_search.enabled`
@@ -910,11 +942,11 @@ The `metadata` field serves two purposes:
 
 2. **File metadata columns** — For [file-based connectors](../../components/data-connectors/#metadata-columns) (S3, ABFS, File, FTP, SFTP, SMB, NFS, HTTP/HTTPS), the following reserved keys enable virtual columns that expose per-file object store metadata in query results:
 
-    | Key              | Value       | Column Type            | Description                        |
-    | ---------------- | ----------- | ---------------------- | ---------------------------------- |
-    | `_location`      | `enabled`   | `Utf8`                 | Full URI of the source file        |
-    | `_last_modified` | `enabled`   | `Timestamp(µs, "UTC")` | When the file was last modified    |
-    | `_size`          | `enabled`   | `UInt64`               | File size in bytes                 |
+    | Key              | Value     | Column Type            | Description                     |
+    | ---------------- | --------- | ---------------------- | ------------------------------- |
+    | `_location`      | `enabled` | `Utf8`                 | Full URI of the source file     |
+    | `_last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
+    | `_size`          | `enabled` | `UInt64`               | File size in bytes              |
 
     ```yaml
     datasets:
diff --git a/website/docs/reference/sql/search.md b/website/docs/reference/sql/search.md
@@ -13,6 +13,7 @@ This section documents search capabilities in Spice SQL, including vector search
 - [Vector Search (`vector_search`)](#vector-search-vector_search)
   - [Usage](#usage)
     - [Example](#example)
+  - [Multi-Query (Late-Interaction) Form](#multi-query-late-interaction-form)
 - [Full-Text Search (`text_search`)](#full-text-search-text_search)
   - [Usage](#usage-1)
     - [Example](#example-1)
@@ -61,6 +62,17 @@ LIMIT 2;
 
 See [Vector-Based Search](../../features/search/vector-search) for configuration and advanced usage.
 
+### Multi-Query (Late-Interaction) Form
+
+When the target column is a [multi-vector column](../../features/search/multi-vector), `vector_search` also accepts an array of query strings. Each query is embedded independently and the per-row score is `Σ_q max_e cos(q, e)` — ColBERT-style late interaction. Passing an array to a scalar or chunked column returns an error. At most 32 query strings are accepted per call.
+
+```sql
+SELECT product_id, name, score
+FROM vector_search(products, ['hiking', 'waterproof', 'lightweight'], tags)
+ORDER BY score DESC
+LIMIT 10;
+```
+
 ---
 
 ## Full-Text Search (`text_search`)