Skip to content

Commit b9129dc

Browse files
refine vector search doc (#2955)
## Versions - [X] dev - [ ] 3.0 - [ ] 2.1 - [ ] 2.0 ## Languages - [X] Chinese - [X] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built
1 parent e805b62 commit b9129dc

2 files changed

Lines changed: 115 additions & 55 deletions

File tree

docs/ai/vector-search.md

Lines changed: 54 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,8 @@ Vector retrieval in RAG is not limited to text; it naturally extends to multimod
3434

3535
Starting from version 2.0, Apache Doris supports nearest-neighbor search based on vector distance. Performing vector search with SQL is natural and simple:
3636

37-
```
38-
SELECT id,
39-
l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
37+
```sql
38+
SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
4039
FROM vector_table
4140
ORDER BY distance
4241
LIMIT 10;
@@ -50,7 +49,7 @@ From version 4.0, Apache Doris officially supports ANN search. No additional dat
5049

5150
Using the common [SIFT](http://corpus-texmex.irisa.fr/) dataset as an example, you can create a table like this:
5251

53-
```
52+
```sql
5453
CREATE TABLE sift_1M (
5554
id int NOT NULL,
5655
embedding array<float> NOT NULL COMMENT "",
@@ -87,10 +86,11 @@ Import via S3 TVF:
8786
```sql
8887
INSERT INTO sift_1M
8988
SELECT *
90-
FROM S3("uri" =
91-
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv", "format" = "csv");
89+
FROM S3(
90+
"uri" = "https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
91+
"format" = "csv");
9292

93-
select count(*) from sift_1M
93+
SELECT count(*) FROM sift_1M
9494

9595
+----------+
9696
| count(*) |
@@ -101,8 +101,15 @@ select count(*) from sift_1M
101101

102102
The SIFT dataset ships with a ground-truth set for result validation. Pick one query vector and first run an exact Top-N using the precise distance:
103103

104-
```
105-
SELECT id, l2_distance(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) as distance FROM sift_1M ORDER BY distance limit 10
104+
```sql
105+
SELECT id,
106+
L2_distance(
107+
embedding,
108+
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
109+
) AS distance
110+
FROM sift_1m
111+
ORDER BY distance
112+
LIMIT 10;
106113
--------------
107114

108115
+--------+----------+
@@ -124,8 +131,15 @@ SELECT id, l2_distance(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44
124131

125132
When using `l2_distance` or `inner_product`, Doris computes the distance between the query vector and all 1,000,000 candidate vectors, then applies a TopN operator globally. Using `l2_distance_approximate` / `inner_product_approximate` triggers the index path:
126133

127-
```
128-
SELECT id, l2_distance_approximate(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) as distance FROM sift_1M ORDER BY distance limit 10
134+
```sql
135+
SELECT id,
136+
l2_distance_approximate(
137+
embedding,
138+
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]
139+
) AS distance
140+
FROM sift_1m
141+
ORDER BY distance
142+
LIMIT 10;
129143
--------------
130144

131145
+--------+----------+
@@ -157,8 +171,13 @@ Beyond the common TopN nearest neighbor search (returning the closest N records)
157171

158172
Example SQL:
159173

160-
```
161-
SELECT count(*) FROM sift_1M WHERE l2_distance_approximate(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) > 300
174+
```sql
175+
SELECT count(*)
176+
FROM sift_1m
177+
WHERE l2_distance_approximate(
178+
embedding,
179+
[0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2])
180+
> 300
162181
--------------
163182

164183
+----------+
@@ -175,8 +194,15 @@ These range-based vector searches are also accelerated by the ANN index: the ind
175194

176195
Compound Search combines an ANN TopN search with a range predicate in the same SQL statement, returning the TopN results that also satisfy a distance constraint.
177196

178-
```
179-
SELECT id, l2_distance_approximate(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) as dist FROM sift_1M WHERE l2_distance_approximate(embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) > 300 ORDER BY dist limit 10
197+
```sql
198+
SELECT id,
199+
l2_distance_approximate(
200+
embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2]) as dist
201+
FROM sift_1M
202+
WHERE l2_distance_approximate(
203+
embedding, [0,11,77,24,3,0,0,0,28,70,125,8,0,0,0,0,44,35,50,45,9,0,0,0,4,0,4,56,18,0,3,9,16,17,59,10,10,8,57,57,100,105,125,41,1,0,6,92,8,14,73,125,29,7,0,5,0,0,8,124,66,6,3,1,63,5,0,1,49,32,17,35,125,21,0,3,2,12,6,109,21,0,0,35,74,125,14,23,0,0,6,50,25,70,64,7,59,18,7,16,22,5,0,1,125,23,1,0,7,30,14,32,4,0,2,2,59,125,19,4,0,0,2,1,6,53,33,2])
204+
> 300
205+
ORDER BY dist limit 10
180206
--------------
181207

182208
+--------+----------+
@@ -206,21 +232,21 @@ This refers to applying other predicates before the ANN TopN and returning the T
206232

207233
Example with a small 8-D vector and a text filter:
208234

209-
```
210-
create table ann_with_fulltext (
211-
id int not null,
212-
embedding array<float> not null,
213-
comment String not null,
214-
value int null,
235+
```sql
236+
CREATE TABLE ann_with_fulltext (
237+
id int NOT NULL,
238+
embedding array<float> NOT NULL,
239+
comment String NOT NULL,
240+
value int NULL,
215241
INDEX idx_comment(`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment',
216242
INDEX ann_embedding(`embedding`) USING ANN PROPERTIES("index_type"="hnsw","metric_type"="l2_distance","dim"="8")
217-
) duplicate key (`id`)
218-
distributed by hash(`id`) buckets 1
219-
properties("replication_num"="1");
243+
) DUPLICATE KEY (`id`)
244+
DISTRIBUTED BY HASH(`id`) BUCKETS 1
245+
PROPERTIES("replication_num"="1");
220246
```
221247

222248
Insert sample data and search only within rows where `comment` contains “music”:
223-
```
249+
```sql
224250
INSERT INTO ann_with_fulltext VALUES
225251
(1, [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8], 'this is about music', 10),
226252
(2, [0.2,0.1,0.5,0.3,0.9,0.4,0.7,0.1], 'sports news today', 20),
@@ -259,7 +285,7 @@ With FLAT encoding, an HNSW index (raw vectors plus graph structure) may consume
259285

260286
Vector quantization compresses float32 storage to reduce memory. Doris currently supports two scalar quantization schemes: INT8 and INT4 (SQ8 / SQ4). Example using SQ8:
261287

262-
```
288+
```sql
263289
CREATE TABLE sift_1M (
264290
id int NOT NULL,
265291
embedding array<float> NOT NULL COMMENT "",
@@ -343,13 +369,14 @@ Disable query profiling for ultra latency-sensitive queries.
343369
2. ANN index is only supported on DuplicateKey table model.
344370
3. Doris uses pre-filter semantics (predicates applied before ANN TopN). If predicates include columns without secondary indexes that can precisely locate rows (e.g., no inverted index), Doris falls back to brute force to preserve correctness.
345371
Example:
346-
```
372+
```sql
347373
SELECT id, l2_distance_approximate(embedding, [xxx]) AS distance
348374
FROM sift_1M
349375
WHERE round(id) > 100
350376
ORDER BY distance LIMIT 10;
351377
```
352378
Although `id` is a key, without a secondary index (such as an inverted index), its predicate is applied after index analysis, so Doris falls back to brute force to honor pre-filter semantics.
379+
353380
4. If the distance function in SQL does not match the metric type defined in the index DDL, Doris cannot use the ANN index for TopN—even if you call `l2_distance_approximate` / `inner_product_approximate`.
354381
5. For metric type `inner_product`, only `ORDER BY inner_product_approximate(...) DESC LIMIT N` (DESC required) can be accelerated by the ANN index.
355382
6. The first parameter of `xxx_approximate()` must be a ColumnArray, and the second must be a CAST or ArrayLiteral. Reversing them triggers brute-force search.

0 commit comments

Comments
 (0)