|
| 1 | +--- |
| 2 | +title: 'Multi-Vector Search' |
| 3 | +sidebar_label: 'Multi-Vector Search' |
| 4 | +description: 'Embed list-of-strings columns as a column of vectors and use ColBERT-style late-interaction search in Spice.' |
| 5 | +sidebar_position: 3 |
| 6 | +tags: |
| 7 | + - search |
| 8 | + - embeddings |
| 9 | + - models |
| 10 | +--- |
| 11 | + |
| 12 | +A multi-vector column stores many embedding vectors per row rather than a single vector. Spice produces a multi-vector column by embedding each element of a `List<Utf8>` source column independently, yielding a `List<FixedSizeList<Float32, N>>` embedding column. |
| 13 | + |
| 14 | +Multi-vector embeddings are useful when a single row has several distinct pieces of text — for example, a product with many tags, a paper with multiple titles and section headings, or a user with a set of historical queries. Each element is embedded and scored separately, and per-row results are produced by aggregating the per-element similarities. |
| 15 | + |
| 16 | +## How Multi-Vector Differs from Chunking |
| 17 | + |
| 18 | +Chunking splits one long string (such as a document body) into pieces and embeds each piece. Multi-vector starts from a column that is already a list of independent strings and embeds each list element as-is. |
| 19 | + |
| 20 | +| Source column type | Embedding mode | Produced embedding type | |
| 21 | +| ------------------ | ---------------------- | --------------------------------- | |
| 22 | +| `Utf8` | Scalar (default) | `FixedSizeList<Float32, N>` | |
| 23 | +| `Utf8` + chunking | Chunked | `List<FixedSizeList<Float32, N>>` | |
| 24 | +| `List<Utf8>` | Multi-vector (default) | `List<FixedSizeList<Float32, N>>` | |
| 25 | + |
| 26 | +Multi-vector and chunked columns share the same Arrow type, but the per-element offsets column (`<column>_offsets`) is only produced for chunked columns. |
| 27 | + |
| 28 | +## Configuring a Multi-Vector Column |
| 29 | + |
| 30 | +Define an embedding on a `List<Utf8>` column the same way as a scalar string column. Spice detects the list type and embeds each element independently. |
| 31 | + |
| 32 | +```yaml |
| 33 | +datasets: |
| 34 | + - from: file:products.parquet |
| 35 | + name: products |
| 36 | + acceleration: |
| 37 | + enabled: true |
| 38 | + columns: |
| 39 | + - name: tags # List<Utf8> |
| 40 | + embeddings: |
| 41 | + - from: local_embedding_model |
| 42 | + aggregation: max |
| 43 | + max_elements_per_row: 64 |
| 44 | + |
| 45 | +embeddings: |
| 46 | + - from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2 |
| 47 | + name: local_embedding_model |
| 48 | +``` |
| 49 | +
|
| 50 | +### Aggregation Strategies |
| 51 | +
|
| 52 | +When a multi-vector column is queried with a single query string, each element's similarity to the query is computed, and the per-row score is the aggregate of those similarities. |
| 53 | +
|
| 54 | +| `aggregation` | Description | |
| 55 | +| ------------- | ---------------------------------------------------------------------------------- | |
| 56 | +| `max` | ColBERT-style `MaxSim`. Row scores as high as its best-matching element (default). | |
| 57 | +| `mean` | Average similarity across elements. Favors rows where most elements are relevant. | |
| 58 | +| `sum` | Sum of similarities. Biases toward rows with many matching elements. | |
| 59 | + |
| 60 | +### Element Caps |
| 61 | + |
| 62 | +Multi-vector columns default to embedding the first 32 elements per row. Raise the cap with `max_elements_per_row` (hard-capped at `1024`). Excess elements are dropped with a warning log so that rows with unbounded tag counts do not blow up embedding cost. |
| 63 | + |
| 64 | +## Querying with `vector_search` |
| 65 | + |
| 66 | +A multi-vector column is queried with the standard `vector_search` UDTF. The configured `aggregation` is applied automatically. |
| 67 | + |
| 68 | +```sql |
| 69 | +SELECT product_id, name, score |
| 70 | +FROM vector_search(products, 'travel accessories', tags) |
| 71 | +ORDER BY score DESC |
| 72 | +LIMIT 10; |
| 73 | +``` |
| 74 | + |
| 75 | +## Late-Interaction (Multi-Query) Search |
| 76 | + |
| 77 | +Multi-vector columns also support ColBERT-style late-interaction search, where the query itself is an array of strings. Each query is embedded independently, the best-matching element is selected for each query (`MaxSim`), and the per-row score is the sum across queries: |
| 78 | + |
| 79 | +$$ |
| 80 | +\text{score}(d) = \sum_{q \in Q} \max_{e \in d} \cos(q, e) |
| 81 | +$$ |
| 82 | + |
| 83 | +```sql |
| 84 | +SELECT product_id, name, score |
| 85 | +FROM vector_search( |
| 86 | + products, |
| 87 | + ['hiking', 'waterproof', 'lightweight'], |
| 88 | + tags |
| 89 | +) |
| 90 | +ORDER BY score DESC |
| 91 | +LIMIT 10; |
| 92 | +``` |
| 93 | + |
| 94 | +Late-interaction search is only supported on multi-vector columns; passing an array of queries to a scalar or chunked column returns an error. A maximum of 32 query strings are accepted per call. |
| 95 | + |
| 96 | +## Passthrough Multi-Vector Columns |
| 97 | + |
| 98 | +Datasets that already contain multi-vector columns can be used directly when their schema matches the conventions in [Vector-Based Search](vector-search#using-existing-embeddings): |
| 99 | + |
| 100 | +- Column name: `<original_column>_embedding` |
| 101 | +- Type: `List<FixedSizeList<Float32 or Float64, N>>` |
| 102 | +- No offsets column (that is only required for chunked scalar columns) |
| 103 | + |
| 104 | +Declare the underlying column's embedding in `spicepod.yaml` so that Spice knows which embedding model the existing vectors came from. |
| 105 | + |
| 106 | +## Limitations |
| 107 | + |
| 108 | +- Multi-vector embeddings require the source column to be `List<Utf8>` or `LargeList<Utf8>`. |
| 109 | +- Late-interaction search accepts at most 32 query strings per call. |
| 110 | +- Multi-vector columns cannot currently be stored in an external vector engine; use a [data accelerator](../../components/data-accelerators) with `acceleration.enabled: true` to cache embeddings. |
0 commit comments