Skip to content

Commit 708c57b

Browse files
ea-ruslucas-koontz
andauthored
Faiss doc (#12346)
Co-authored-by: Lucas Koontz <lucas.emanuel.koontz@gmail.com>
1 parent 4e9d974 commit 708c57b

2 files changed

Lines changed: 144 additions & 53 deletions

File tree

Lines changed: 141 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,169 @@
11
# DuckDB + Faiss Handler
22

3+
## Using duckdb_faiss handler
4+
35
This handler combines DuckDB for metadata storage and SQL filtering with Faiss for high-performance vector similarity search.
46

5-
## Features
67

7-
- **DuckDB**: Store metadata, content, and IDs with full SQL filtering capabilities
8-
- **Faiss**: High-speed vector indexing and similarity search (CPU/GPU support)
9-
- **Hybrid Search**: Combine metadata filtering with vector similarity search
10-
- **Persistence**: Automatic persistence via MindsDB's handler storage system
8+
### 1. Create a FAISS Database and Knowledge Base
119

12-
## Configuration
10+
`duckdb_faiss` handler is installed by default with mindsdb. When the `storage` parameter is not specified it creates default vector storage. It can be:
11+
- pgvector (if the KB_PGVECTOR_URL env variable is defined)
12+
- otherwise, a duckdb_faiss database will be created by default
1313

14-
### Connection Parameters
14+
Create knowledge base with default vector db:
15+
```
16+
CREATE KNOWLEDGE BASE kb_animals
17+
USING
18+
embedding_model = {"provider": "openai", "model_name": "text-embedding-3-small"};
19+
```
1520

16-
- `metric`: Distance metric - "cosine" or "l2" (default: "cosine")
17-
- `backend`: Faiss backend - "ivf", "flat", "hnsw" (default: "hnsw")
18-
- `use_gpu`: Enable GPU acceleration (default: False)
19-
- `nlist`: IVF parameter for clustering (default: 1024)
20-
- `nprobe`: IVF search parameter (default: 32)
21-
- `hnsw_m`: HNSW connectivity parameter (default: 32)
22-
- `hnsw_ef_search`: HNSW search parameter (default: 64)
23-
- `persist_directory`: Optional custom storage path
21+
You can create your own duckdb_faiss database manually as well:
22+
23+
```sql
24+
CREATE DATABASE mindsdb_faiss
25+
WITH ENGINE = 'duckdb_faiss',
26+
PARAMETERS = {
27+
"persist_directory": "/data/faiss_db_location",
28+
"metric": "ip",
29+
"use_gpu": false,
30+
"nlist": 10,
31+
"nprobe": 2
32+
}
33+
```
34+
35+
And use in knowledge base:
36+
```sql
37+
CREATE KNOWLEDGE BASE kb_animals
38+
USING
39+
storage = mindsdb_faiss.animals_table,
40+
embedding_model = {"provider": "openai", "model_name": "text-embedding-3-small"};
41+
```
2442

25-
## Usage
43+
Parameters for duckdb_faiss database:
44+
- `persist_directory`: Optional, custom storage path. If not set - a handler storage will be used
45+
- `metric`: Optional, distance metric - possible values: cosine/ip/l1/l2. Default is "cosine"
46+
- `use_gpu`: Optional, enable GPU acceleration (default: False)
47+
- `nlist`: Optional, IVF parameter for clustering. Used as default value in create IVF index. Default is 1024
48+
- `nprobe`: Optional, controls the number of clusters to search during a query. Default is 1
2649

27-
### Create Database Connection
2850

51+
### 2. Insert data
52+
53+
The same as for other vector storages, insert from select or from values:
2954
```sql
30-
CREATE DATABASE faiss_db
31-
WITH
32-
ENGINE = 'duckdb_faiss',
33-
PARAMETERS = {};
55+
INSERT INTO kb_animals (id, content, legs)
56+
VALUES (1, 'duck', 2), (2, 'cat', 4);
3457
```
3558

36-
### Create knowledge base
59+
### 3. Querying the Knowledge Base
60+
61+
**Vector similarity search**
62+
```sql
63+
SELECT * FROM kb_animals
64+
WHERE content = 'cat' AND distance < 0.5;
65+
```
3766

67+
**Mixed search**
3868
```sql
39-
create knowledge base kb_faiss
40-
using storage = faiss_db.kb_faiss,
41-
embedding_model={"provider": "openai", "model_name": "text-embedding-3-small"},
42-
metadata_columns=["title", "category"];
69+
SELECT * FROM kb_animals
70+
WHERE content = 'cat' AND legs = 4;
4371
```
72+
Supported `LIKE`, `NOT LIKE`, `>`, `>=`, `<`, `<=` filters for metadata columns.
4473

45-
### Insert Data
4674

75+
**Hybrid search**
4776
```sql
48-
INSERT INTO kb_faiss (id, content, metadata, title, category, embeddings)
49-
VALUES
50-
('doc1', 'This is a news article about technology', 'Tech News', 'news'),
51-
('doc2', 'A scientific paper about AI research', 'AI Research', 'science'),
52-
('doc3', 'Business update on market trends', 'Market Update', 'business');
77+
SELECT * FROM kb_animals
78+
WHERE content = 'cat' AND legs = 4
79+
AND hybrid_search = TRUE;
5380
```
5481

55-
### Vector Search
82+
Can be used with bool `hybrid_search` or float `hybrid_search_alpha` parameters
83+
84+
85+
## 4. Create FAISS Indexes
86+
87+
When a new duckdb_faiss is created, it starts from using [flat FAISS index](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlat.html). It works by scanning all index file to get similar vectors. Also a flat index is located in RAM, and its size is restricted by available memory.
88+
To speed up vector search you can convert to other type of indexes. Available options:
89+
- ivf - [Inverted File](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html). It is also located in memory, but faster than FLAT
90+
- ivf_file, the same as ivf, but located on disk and doesn't require being loaded into RAM. This type of index isn't supported on Windows.
91+
92+
Important: It is not possible to create an index for an empty FAISS knowledge base because both types of indexes require data in the knowledge base before creating it. The loaded data is used to train the index. The size of the training data and the number of clusters can affect index quality.
5693

94+
Query:
5795
```sql
58-
-- Vector similarity search
59-
SELECT * FROM kb_faiss
60-
WHERE content = 'paper' and distance < 0.5
61-
LIMIT 10;
62-
63-
-- With metadata search
64-
SELECT * FROM kb_faiss
65-
WHERE content = 'paper' and category = 'news'
66-
LIMIT 10;
67-
68-
-- Hybrid search (keyword + vector)
69-
SELECT * FROM kb_faiss
70-
WHERE content = 'paper' and category = 'news' and hybrid_search=true
71-
LIMIT 10;
96+
CREATE INDEX ON KNOWLEDGE_BASE kb_animals
97+
WITH (
98+
type = 'ivf_file',
99+
nlist = 100,
100+
train_count = 10000
101+
);
72102
```
73103

74-
### Delete document
104+
Parameters:
105+
- `type` - optional, default is ivf_file
106+
- for windows default is the 'ivf'
107+
- `nlist` optional, number of clusters for IVF, default 1024,
108+
- `train_count` optional, number of vectors to use for training, default is calculated from nlist.
109+
110+
111+
## Implementation details
112+
113+
### How it works
114+
115+
When a duckdb_faiss table is created, the handler creates a folder for it. It contains:
116+
- duckdb.db - a duckdb database to store metadata for knowledge base
117+
- faiss_index - faiss index file
118+
Folder name - is a table name
119+
120+
The other files in folders in faiss table:
121+
- duckdb.db* - all files related to duckdb (duckdb.db.wal)
122+
- faiss_index* - all files related faiss index (partitions, merged index for ivf_file)
123+
- dump/ - temporal folder for extracted vectors
124+
- recover/ - temporal folder for index backup
125+
126+
### Locks and concurrency
127+
128+
Because IVF and FLAT indexes are loaded in RAM and the disk copy is used only to store changes in the index (insert/delete records), small indexes are unloaded from RAM after each request and loaded again before the next request.
129+
130+
When the index becomes large the read time increases, so the index is cached in RAM and locked to prevent using it in different processes or threads. If mindsdb is used from different threads or processes, an `index file locked` exception might appear. The lock is released when the handler cache is cleared (default timeout is 1 min).
131+
132+
Because insert-from-select into the knowledge base is performed in the background, the background process can't use the FAISS index if it is locked by a GUI. The implemented workaround is:
133+
- before the query is sent into background
134+
- search all locks for vector bases of KBs in the query and unload the FAISS database from cache
135+
- after executing query in background
136+
- do the same (unload the FAISS database from cache)
137+
138+
Locks also prevent inserting into the knowledge base using threads. This query won't work:
75139
```sql
76-
DELETE FROM kb_faiss
77-
WHERE id = 'doc2';
140+
INSERT INTO my_kb SELECT * FROM db1.table1
141+
USING threads=10
78142
```
143+
144+
145+
Important: The FAISS index isn't locked on Windows; the FAISS library can write to a locked file there.
146+
147+
### Checking resources
148+
149+
**RAM**
150+
For indexes located in RAM, when data is inserted into the FAISS index it forecasts the required memory and does not allow the insert if it exceeds available memory.
151+
This check is run after every 10k records inserted.
152+
153+
**disk**
154+
When an index is created, it requires two to three times more disk space (depending on the index type). The free disk space is also checked before starting to create the index.
155+
What occupies disk:
156+
- an old faiss_index file (its backup)
157+
- fetched vectors from old index
158+
- a new index
159+
160+
### Keyword search
161+
162+
Implemented by using duckdb [fts extension](https://duckdb.org/docs/stable/core_extensions/full_text_search#match_bm25-function)
163+
When keyword search is used and FTS index doesn't exist—it is created. This index is removed when any record is inserted into KB (because FTS index isn't updated after inserts in DuckDB).
164+
165+
### Mixed search optimizations
166+
For queries that mix vectors and rich metadata:
167+
- The handler estimates metadata selectivity (`COUNT(*) WHERE <filters>`) to choose the best execution plan.
168+
- **Vector-first strategy** fetches an expanding set of candidates from FAISS until enough records satisfy the metadata filters.
169+
- **Metadata-first strategy** constrains candidate IDs via DuckDB before scoring them in FAISS batches (`META_BATCH = 10,000`).

mindsdb/integrations/handlers/duckdb_faiss_handler/faiss_index.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,7 @@ class FaissParams(BaseModel):
2828
metric: str | None = "cosine"
2929
use_gpu: bool | None = False
3030
nlist: int | None = 1024
31-
nprobe: int | None = 32
32-
hnsw_m: int | None = 32
33-
hnsw_ef_search: int | None = 64
31+
nprobe: int | None = None
3432

3533

3634
def merge_ondisk(trained_index: faiss.Index, shard_fnames: List[str], ivfdata_fname: str, shift_ids=False) -> None:
@@ -161,6 +159,8 @@ def _load_index(self):
161159
self.index_type = "ivf_file"
162160
else:
163161
self.index_type = "ivf"
162+
if self.config.nprobe is not None:
163+
self.index.nprobe = self.config.nprobe
164164

165165
def close(self):
166166
if self.index_fd is not None:

0 commit comments

Comments
 (0)