Skip to content

Commit 750f85c

Browse files
committed
Document multi-vector embeddings and late-interaction search
Add a new Multi-Vector Search feature page covering column-of-vectors embeddings (List<Utf8> source columns), aggregation strategies (max/mean/sum), max_elements_per_row caps, and ColBERT-style late-interaction multi-query vector_search. Cross-link from the search index, vector search, and embeddings component pages, and document the previously-missing aggregation and max_elements_per_row fields in the datasets reference.
1 parent 789f138 commit 750f85c

6 files changed

Lines changed: 205 additions & 5 deletions

File tree

website/docs/components/embeddings/index.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,28 @@ datasets:
303303
row_id: id
304304
```
305305

306+
### Multi-Vector Embeddings
307+
308+
When the source column is `List<Utf8>` (or `LargeList<Utf8>`), Spice embeds each list element independently and produces a `List<FixedSizeList<Float32, N>>` column. This is the multi-vector (column-of-vectors) mode, useful for rows that carry several independent pieces of text such as tags, section headings, or historical queries.
309+
310+
```yaml
311+
datasets:
312+
- from: file:products.parquet
313+
name: products
314+
acceleration:
315+
enabled: true
316+
columns:
317+
- name: tags # List<Utf8>
318+
embeddings:
319+
- from: local_embedding_model
320+
aggregation: max
321+
max_elements_per_row: 64
322+
```
323+
324+
The `aggregation` field controls how per-element similarities are combined into a per-row score during vector search. `max` (default) is ColBERT-style `MaxSim`; `mean` and `sum` are also supported. The `max_elements_per_row` field caps how many list elements are embedded per row (default `32`, hard limit `1024`). Multi-vector columns also support [ColBERT-style late-interaction search](../features/search/multi-vector#late-interaction-multi-query-search) via an array of query strings.
325+
326+
See [Multi-Vector Search](../features/search/multi-vector) for query usage and [`columns[*].embeddings[*]`](../reference/spicepod/datasets#columnsembeddings) for the full field reference.
327+
306328
import DocCardList from '@theme/DocCardList';
307329

308330
<DocCardList />

website/docs/features/search/index.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Spice provides comprehensive search capabilities enabling developers to query da
2222
Spice supports multiple search methods:
2323

2424
- **Vector Search**: Semantic search using embeddings to retrieve data by meaning and similarity.
25+
- **Multi-Vector Search**: Search over columns of vectors, including ColBERT-style late-interaction queries.
2526
- **Full-Text Search**: Keyword-driven search optimized for text data retrieval.
2627
- **Hybrid Search**: Combine multiple search methods using Reciprocal Rank Fusion (RRF) for improved relevance.
2728
- **SQL Search**: Traditional SQL queries for precise and structured searches.
@@ -52,6 +53,27 @@ LIMIT 5
5253

5354
For complete SQL UDTF specifications, see [Vector-Based Search SQL UDTF](search/vector-search#sql-udtf).
5455

56+
### Multi-Vector Search
57+
58+
Multi-vector search operates on columns that store many vectors per row, such as per-tag or per-section embeddings. It also supports ColBERT-style late-interaction queries where the query itself is an array of strings.
59+
60+
**Requirements:**
61+
62+
- A list-typed source column (`List<Utf8>`) embedded with a multi-vector aggregation
63+
64+
**Getting Started:**
65+
66+
- [Multi-Vector Search Docs](search/multi-vector)
67+
68+
**Example SQL Multi-Vector Search:**
69+
70+
```sql
71+
SELECT product_id, name, score
72+
FROM vector_search(products, ['hiking', 'waterproof'], tags)
73+
ORDER BY score DESC
74+
LIMIT 10
75+
```
76+
5577
### Full-Text Search
5678

5779
Full-text search efficiently retrieves records matching specific keywords.
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: 'Multi-Vector Search'
3+
sidebar_label: 'Multi-Vector Search'
4+
description: 'Embed list-of-strings columns as a column of vectors and use ColBERT-style late-interaction search in Spice.'
5+
sidebar_position: 3
6+
tags:
7+
- search
8+
- embeddings
9+
- models
10+
---
11+
12+
A multi-vector column stores many embedding vectors per row rather than a single vector. Spice produces a multi-vector column by embedding each element of a `List<Utf8>` source column independently, yielding a `List<FixedSizeList<Float32, N>>` embedding column.
13+
14+
Multi-vector embeddings are useful when a single row has several distinct pieces of text — for example, a product with many tags, a paper with multiple titles and section headings, or a user with a set of historical queries. Each element is embedded and scored separately, and per-row results are produced by aggregating the per-element similarities.
15+
16+
## How Multi-Vector Differs from Chunking
17+
18+
Chunking splits one long string (such as a document body) into pieces and embeds each piece. Multi-vector starts from a column that is already a list of independent strings and embeds each list element as-is.
19+
20+
| Source column type | Embedding mode | Produced embedding type |
21+
| ------------------ | ---------------------- | --------------------------------- |
22+
| `Utf8` | Scalar (default) | `FixedSizeList<Float32, N>` |
23+
| `Utf8` + chunking | Chunked | `List<FixedSizeList<Float32, N>>` |
24+
| `List<Utf8>` | Multi-vector (default) | `List<FixedSizeList<Float32, N>>` |
25+
26+
Multi-vector and chunked columns share the same Arrow type, but the per-element offsets column (`<column>_offsets`) is only produced for chunked columns.
27+
28+
## Configuring a Multi-Vector Column
29+
30+
Define an embedding on a `List<Utf8>` column the same way as a scalar string column. Spice detects the list type and embeds each element independently.
31+
32+
```yaml
33+
datasets:
34+
- from: file:products.parquet
35+
name: products
36+
acceleration:
37+
enabled: true
38+
columns:
39+
- name: tags # List<Utf8>
40+
embeddings:
41+
- from: local_embedding_model
42+
aggregation: max
43+
max_elements_per_row: 64
44+
45+
embeddings:
46+
- from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
47+
name: local_embedding_model
48+
```
49+
50+
### Aggregation Strategies
51+
52+
When a multi-vector column is queried with a single query string, each element's similarity to the query is computed, and the per-row score is the aggregate of those similarities.
53+
54+
| `aggregation` | Description |
55+
| ------------- | ---------------------------------------------------------------------------------- |
56+
| `max` | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
57+
| `mean` | Average similarity across elements. Favors rows where most elements are relevant. |
58+
| `sum` | Sum of similarities. Biases toward rows with many matching elements. |
59+
60+
### Element Caps
61+
62+
Multi-vector columns default to embedding the first 32 elements per row. Raise the cap with `max_elements_per_row` (hard-capped at `1024`). Excess elements are dropped with a warning log so that rows with unbounded tag counts do not blow up embedding cost.
63+
64+
## Querying with `vector_search`
65+
66+
A multi-vector column is queried with the standard `vector_search` UDTF. The configured `aggregation` is applied automatically.
67+
68+
```sql
69+
SELECT product_id, name, score
70+
FROM vector_search(products, 'travel accessories', tags)
71+
ORDER BY score DESC
72+
LIMIT 10;
73+
```
74+
75+
## Late-Interaction (Multi-Query) Search
76+
77+
Multi-vector columns also support ColBERT-style late-interaction search, where the query itself is an array of strings. Each query is embedded independently, the best-matching element is selected for each query (`MaxSim`), and the per-row score is the sum across queries:
78+
79+
$$
80+
\text{score}(d) = \sum_{q \in Q} \max_{e \in d} \cos(q, e)
81+
$$
82+
83+
```sql
84+
SELECT product_id, name, score
85+
FROM vector_search(
86+
products,
87+
['hiking', 'waterproof', 'lightweight'],
88+
tags
89+
)
90+
ORDER BY score DESC
91+
LIMIT 10;
92+
```
93+
94+
Late-interaction search is only supported on multi-vector columns; passing an array of queries to a scalar or chunked column returns an error. A maximum of 32 query strings are accepted per call.
95+
96+
## Passthrough Multi-Vector Columns
97+
98+
Datasets that already contain multi-vector columns can be used directly when their schema matches the conventions in [Vector-Based Search](vector-search#using-existing-embeddings):
99+
100+
- Column name: `<original_column>_embedding`
101+
- Type: `List<FixedSizeList<Float32 or Float64, N>>`
102+
- No offsets column (that is only required for chunked scalar columns)
103+
104+
Declare the underlying column's embedding in `spicepod.yaml` so that Spice knows which embedding model the existing vectors came from.
105+
106+
## Limitations
107+
108+
- Multi-vector embeddings require the source column to be `List<Utf8>` or `LargeList<Utf8>`.
109+
- Late-interaction search accepts at most 32 query strings per call.
110+
- Multi-vector columns cannot currently be stored in an external vector engine; use a [data accelerator](../../components/data-accelerators) with `acceleration.enabled: true` to cache embeddings.

website/docs/features/search/vector-search.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ Vector search uses embeddings (numerical representations of text or data) to fin
1818
- Retrieval-augmented generation (RAG) applications
1919
- Recommendation systems
2020

21+
For embedding columns that contain many vectors per row (for example, one vector per tag or per section), see [Multi-Vector Search](multi-vector).
22+
2123
## Embedding Models
2224

2325
Spice supports two types of embedding providers:

website/docs/reference/spicepod/datasets.md

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -806,6 +806,38 @@ columns:
806806
vector_size: 1024
807807
```
808808

809+
## `columns[*].embeddings[*].aggregation`
810+
811+
Optional. For multi-vector columns (`List<Utf8>` source), the strategy used to combine per-element similarities into a single per-row score during vector search. Only meaningful when the underlying column is list-typed.
812+
813+
| Value | Description |
814+
| ------ | ---------------------------------------------------------------------------------- |
815+
| `max` | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). |
816+
| `mean` | Average similarity across elements. |
817+
| `sum` | Sum of similarities across elements. |
818+
819+
See [Multi-Vector Search](../../features/search/multi-vector) for details.
820+
821+
```yaml
822+
columns:
823+
- name: tags
824+
embeddings:
825+
- from: local_embedding_model
826+
aggregation: max
827+
```
828+
829+
## `columns[*].embeddings[*].max_elements_per_row`
830+
831+
Optional. For multi-vector columns, the maximum number of list elements embedded per row. Defaults to `32`; hard-capped at `1024`. Excess elements are dropped with a warning log.
832+
833+
```yaml
834+
columns:
835+
- name: tags
836+
embeddings:
837+
- from: local_embedding_model
838+
max_elements_per_row: 128
839+
```
840+
809841
## `columns[*].full_text_search` {#columns-search-full-text}
810842

811843
## `columns[*].full_text_search.enabled`
@@ -910,11 +942,11 @@ The `metadata` field serves two purposes:
910942

911943
2. **File metadata columns** — For [file-based connectors](../../components/data-connectors/#metadata-columns) (S3, ABFS, File, FTP, SFTP, SMB, NFS, HTTP/HTTPS), the following reserved keys enable virtual columns that expose per-file object store metadata in query results:
912944

913-
| Key | Value | Column Type | Description |
914-
| ---------------- | ----------- | ---------------------- | ---------------------------------- |
915-
| `_location` | `enabled` | `Utf8` | Full URI of the source file |
916-
| `_last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
917-
| `_size` | `enabled` | `UInt64` | File size in bytes |
945+
| Key | Value | Column Type | Description |
946+
| ---------------- | --------- | ---------------------- | ------------------------------- |
947+
| `_location` | `enabled` | `Utf8` | Full URI of the source file |
948+
| `_last_modified` | `enabled` | `Timestamp(µs, "UTC")` | When the file was last modified |
949+
| `_size` | `enabled` | `UInt64` | File size in bytes |
918950

919951
```yaml
920952
datasets:

website/docs/reference/sql/search.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ This section documents search capabilities in Spice SQL, including vector search
1313
- [Vector Search (`vector_search`)](#vector-search-vector_search)
1414
- [Usage](#usage)
1515
- [Example](#example)
16+
- [Multi-Query (Late-Interaction) Form](#multi-query-late-interaction-form)
1617
- [Full-Text Search (`text_search`)](#full-text-search-text_search)
1718
- [Usage](#usage-1)
1819
- [Example](#example-1)
@@ -61,6 +62,17 @@ LIMIT 2;
6162

6263
See [Vector-Based Search](../../features/search/vector-search) for configuration and advanced usage.
6364

65+
### Multi-Query (Late-Interaction) Form
66+
67+
When the target column is a [multi-vector column](../../features/search/multi-vector), `vector_search` also accepts an array of query strings. Each query is embedded independently and the per-row score is `Σ_q max_e cos(q, e)` — ColBERT-style late interaction. Passing an array to a scalar or chunked column returns an error. At most 32 query strings are accepted per call.
68+
69+
```sql
70+
SELECT product_id, name, score
71+
FROM vector_search(products, ['hiking', 'waterproof', 'lightweight'], tags)
72+
ORDER BY score DESC
73+
LIMIT 10;
74+
```
75+
6476
---
6577

6678
## Full-Text Search (`text_search`)

0 commit comments

Comments
 (0)