|
1 | 1 | # DuckDB + Faiss Handler |
2 | 2 |
|
| 3 | +## Using duckdb_faiss handler |
| 4 | + |
3 | 5 | This handler combines DuckDB for metadata storage and SQL filtering with Faiss for high-performance vector similarity search. |
4 | 6 |
|
5 | | -## Features |
6 | 7 |
|
7 | | -- **DuckDB**: Store metadata, content, and IDs with full SQL filtering capabilities |
8 | | -- **Faiss**: High-speed vector indexing and similarity search (CPU/GPU support) |
9 | | -- **Hybrid Search**: Combine metadata filtering with vector similarity search |
10 | | -- **Persistence**: Automatic persistence via MindsDB's handler storage system |
| 8 | +### 1. Create a FAISS Database and Knowledge Base |
11 | 9 |
|
12 | | -## Configuration |
| 10 | +`duckdb_faiss` handler is installed by default with mindsdb. When the `storage` parameter is not specified it creates default vector storage. It can be: |
| 11 | +- pgvector (if the KB_PGVECTOR_URL env variable is defined) |
| 12 | +- otherwise, a duckdb_faiss database will be created by default |
13 | 13 |
|
14 | | -### Connection Parameters |
| 14 | +Create knowledge base with default vector db: |
| 15 | +``` |
| 16 | +CREATE KNOWLEDGE BASE kb_animals |
| 17 | +USING |
| 18 | + embedding_model = {"provider": "openai", "model_name": "text-embedding-3-small"}; |
| 19 | +``` |
15 | 20 |
|
16 | | -- `metric`: Distance metric - "cosine" or "l2" (default: "cosine") |
17 | | -- `backend`: Faiss backend - "ivf", "flat", "hnsw" (default: "hnsw") |
18 | | -- `use_gpu`: Enable GPU acceleration (default: False) |
19 | | -- `nlist`: IVF parameter for clustering (default: 1024) |
20 | | -- `nprobe`: IVF search parameter (default: 32) |
21 | | -- `hnsw_m`: HNSW connectivity parameter (default: 32) |
22 | | -- `hnsw_ef_search`: HNSW search parameter (default: 64) |
23 | | -- `persist_directory`: Optional custom storage path |
| 21 | +You can create your own duckdb_faiss database manually as well: |
| 22 | + |
| 23 | +```sql |
| 24 | +CREATE DATABASE mindsdb_faiss |
| 25 | +WITH ENGINE = 'duckdb_faiss', |
| 26 | +PARAMETERS = { |
| 27 | + "persist_directory": "/data/faiss_db_location", |
| 28 | + "metric": "ip", |
| 29 | + "use_gpu": false, |
| 30 | + "nlist": 10, |
| 31 | + "nprobe": 2 |
| 32 | +} |
| 33 | +``` |
| 34 | + |
| 35 | +And use in knowledge base: |
| 36 | +```sql |
| 37 | +CREATE KNOWLEDGE BASE kb_animals |
| 38 | +USING |
| 39 | + storage = mindsdb_faiss.animals_table, |
| 40 | + embedding_model = {"provider": "openai", "model_name": "text-embedding-3-small"}; |
| 41 | +``` |
24 | 42 |
|
25 | | -## Usage |
| 43 | +Parameters for duckdb_faiss database: |
| 44 | +- `persist_directory`: Optional, custom storage path. If not set - a handler storage will be used |
| 45 | +- `metric`: Optional, distance metric - possible values: cosine/ip/l1/l2. Default is "cosine" |
| 46 | +- `use_gpu`: Optional, enable GPU acceleration (default: False) |
| 47 | +- `nlist`: Optional, IVF parameter for clustering. Used as default value in create IVF index. Default is 1024 |
| 48 | +- `nprobe`: Optional, controls the number of clusters to search during a query. Default is 1 |
26 | 49 |
|
27 | | -### Create Database Connection |
28 | 50 |
|
| 51 | +### 2. Insert data |
| 52 | + |
| 53 | +The same as for other vector storages, insert from select or from values: |
29 | 54 | ```sql |
30 | | -CREATE DATABASE faiss_db |
31 | | -WITH |
32 | | - ENGINE = 'duckdb_faiss', |
33 | | - PARAMETERS = {}; |
| 55 | +INSERT INTO kb_animals (id, content, legs) |
| 56 | +VALUES (1, 'duck', 2), (2, 'cat', 4); |
34 | 57 | ``` |
35 | 58 |
|
36 | | -### Create knowledge base |
| 59 | +### 3. Querying the Knowledge Base |
| 60 | + |
| 61 | +**Vector similarity search** |
| 62 | +```sql |
| 63 | +SELECT * FROM kb_animals |
| 64 | +WHERE content = 'cat' AND distance < 0.5; |
| 65 | +``` |
37 | 66 |
|
| 67 | +**Mixed search** |
38 | 68 | ```sql |
39 | | -create knowledge base kb_faiss |
40 | | -using storage = faiss_db.kb_faiss, |
41 | | -embedding_model={"provider": "openai", "model_name": "text-embedding-3-small"}, |
42 | | -metadata_columns=["title", "category"]; |
| 69 | +SELECT * FROM kb_animals |
| 70 | +WHERE content = 'cat' AND legs = 4; |
43 | 71 | ``` |
| 72 | +Supported `LIKE`, `NOT LIKE`, `>`, `>=`, `<`, `<=` filters for metadata columns. |
44 | 73 |
|
45 | | -### Insert Data |
46 | 74 |
|
| 75 | +**Hybrid search** |
47 | 76 | ```sql |
48 | | -INSERT INTO kb_faiss (id, content, metadata, title, category, embeddings) |
49 | | -VALUES |
50 | | - ('doc1', 'This is a news article about technology', 'Tech News', 'news'), |
51 | | - ('doc2', 'A scientific paper about AI research', 'AI Research', 'science'), |
52 | | - ('doc3', 'Business update on market trends', 'Market Update', 'business'); |
| 77 | +SELECT * FROM kb_animals |
| 78 | +WHERE content = 'cat' AND legs = 4 |
| 79 | + AND hybrid_search = TRUE; |
53 | 80 | ``` |
54 | 81 |
|
55 | | -### Vector Search |
| 82 | +Can be used with bool `hybrid_search` or float `hybrid_search_alpha` parameters |
| 83 | + |
| 84 | + |
| 85 | +## 4. Create FAISS Indexes |
| 86 | + |
| 87 | +When a new duckdb_faiss is created, it starts from using [flat FAISS index](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlat.html). It works by scanning all index file to get similar vectors. Also a flat index is located in RAM, and its size is restricted by available memory. |
| 88 | +To speed up vector search you can convert to other type of indexes. Available options: |
| 89 | +- ivf - [Inverted File](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html). It is also located in memory, but faster than FLAT |
| 90 | +- ivf_file, the same as ivf, but located on disk and doesn't require being loaded into RAM. This type of index isn't supported on Windows. |
| 91 | + |
| 92 | +Important: It is not possible to create an index for an empty FAISS knowledge base because both types of indexes require data in the knowledge base before creating it. The loaded data is used to train the index. The size of the training data and the number of clusters can affect index quality. |
56 | 93 |
|
| 94 | +Query: |
57 | 95 | ```sql |
58 | | --- Vector similarity search |
59 | | -SELECT * FROM kb_faiss |
60 | | -WHERE content = 'paper' and distance < 0.5 |
61 | | -LIMIT 10; |
62 | | - |
63 | | --- With metadata search |
64 | | -SELECT * FROM kb_faiss |
65 | | -WHERE content = 'paper' and category = 'news' |
66 | | -LIMIT 10; |
67 | | - |
68 | | --- Hybrid search (keyword + vector) |
69 | | -SELECT * FROM kb_faiss |
70 | | -WHERE content = 'paper' and category = 'news' and hybrid_search=true |
71 | | -LIMIT 10; |
| 96 | +CREATE INDEX ON KNOWLEDGE_BASE kb_animals |
| 97 | +WITH ( |
| 98 | + type = 'ivf_file', |
| 99 | + nlist = 100, |
| 100 | + train_count = 10000 |
| 101 | +); |
72 | 102 | ``` |
73 | 103 |
|
74 | | -### Delete document |
| 104 | +Parameters: |
| 105 | +- `type` - optional, default is ivf_file |
| 106 | + - for windows default is the 'ivf' |
| 107 | +- `nlist` optional, number of clusters for IVF, default 1024, |
| 108 | +- `train_count` optional, number of vectors to use for training, default is calculated from nlist. |
| 109 | + |
| 110 | + |
| 111 | +## Implementation details |
| 112 | + |
| 113 | +### How it works |
| 114 | + |
| 115 | +When a duckdb_faiss table is created, the handler creates a folder for it. It contains: |
| 116 | +- duckdb.db - a duckdb database to store metadata for knowledge base |
| 117 | +- faiss_index - faiss index file |
| 118 | +Folder name - is a table name |
| 119 | + |
| 120 | +The other files in folders in faiss table: |
| 121 | +- duckdb.db* - all files related to duckdb (duckdb.db.wal) |
| 122 | +- faiss_index* - all files related faiss index (partitions, merged index for ivf_file) |
| 123 | +- dump/ - temporal folder for extracted vectors |
| 124 | +- recover/ - temporal folder for index backup |
| 125 | + |
| 126 | +### Locks and concurrency |
| 127 | + |
| 128 | +Because IVF and FLAT indexes are loaded in RAM and the disk copy is used only to store changes in the index (insert/delete records), small indexes are unloaded from RAM after each request and loaded again before the next request. |
| 129 | + |
| 130 | +When the index becomes large the read time increases, so the index is cached in RAM and locked to prevent using it in different processes or threads. If mindsdb is used from different threads or processes, an `index file locked` exception might appear. The lock is released when the handler cache is cleared (default timeout is 1 min). |
| 131 | + |
| 132 | +Because insert-from-select into the knowledge base is performed in the background, the background process can't use the FAISS index if it is locked by a GUI. The implemented workaround is: |
| 133 | +- before the query is sent into background |
| 134 | + - search all locks for vector bases of KBs in the query and unload the FAISS database from cache |
| 135 | +- after executing query in background |
| 136 | + - do the same (unload the FAISS database from cache) |
| 137 | + |
| 138 | +Locks also prevent inserting into the knowledge base using threads. This query won't work: |
75 | 139 | ```sql |
76 | | -DELETE FROM kb_faiss |
77 | | -WHERE id = 'doc2'; |
| 140 | +INSERT INTO my_kb SELECT * FROM db1.table1 |
| 141 | +USING threads=10 |
78 | 142 | ``` |
| 143 | + |
| 144 | + |
| 145 | +Important: The FAISS index isn't locked on Windows; the FAISS library can write to a locked file there. |
| 146 | + |
| 147 | +### Checking resources |
| 148 | + |
| 149 | +**RAM** |
| 150 | +For indexes located in RAM, when data is inserted into the FAISS index it forecasts the required memory and does not allow the insert if it exceeds available memory. |
| 151 | +This check is run after every 10k records inserted. |
| 152 | + |
| 153 | +**disk** |
| 154 | +When an index is created, it requires two to three times more disk space (depending on the index type). The free disk space is also checked before starting to create the index. |
| 155 | +What occupies disk: |
| 156 | +- an old faiss_index file (its backup) |
| 157 | +- fetched vectors from old index |
| 158 | +- a new index |
| 159 | + |
| 160 | +### Keyword search |
| 161 | + |
| 162 | +Implemented by using duckdb [fts extension](https://duckdb.org/docs/stable/core_extensions/full_text_search#match_bm25-function) |
| 163 | +When keyword search is used and FTS index doesn't exist—it is created. This index is removed when any record is inserted into KB (because FTS index isn't updated after inserts in DuckDB). |
| 164 | + |
| 165 | +### Mixed search optimizations |
| 166 | +For queries that mix vectors and rich metadata: |
| 167 | +- The handler estimates metadata selectivity (`COUNT(*) WHERE <filters>`) to choose the best execution plan. |
| 168 | +- **Vector-first strategy** fetches an expanding set of candidates from FAISS until enough records satisfy the metadata filters. |
| 169 | +- **Metadata-first strategy** constrains candidate IDs via DuckDB before scoring them in FAISS batches (`META_BATCH = 10,000`). |
0 commit comments