Skip to content

Commit a7f4732

Browse files
Add FAISS DocumentStore integration (#2844)
* refactor: remove dead code from FAISSDocumentStore - Remove duplicate _check_condition method (lines 111-148) - Remove unreachable comparison operator checks (lines 348-355) - Reduces codebase by 46 lines of unreachable code - All tests still passing (55 passed, 2 skipped) * docs: add pydoc configuration for API reference generation * refactor: remove unused legacy parameters from __init__ - Addresses review feedback from @davidsbatista - Removed: sql_url, faiss_index_factory_str, similarity, isolation_level, duplicate_documents, return_embedding, progress_bar - Updated docstrings * test: add tests for advanced filtering, metadata, and persistence edge cases - Added custom test methods for \count_documents_by_filter\, \get_metadata_fields_info\, and \count_unique_metadata_by_filter\ - Added test for search with and without filters - Added tests for persistence with and without embeddings, and missing files - Added test for to_dict / from_dict roundtrip * fix: resolve lint errors, add py.typed, and clean up test mixins - Fixed Ruff E501, EM101, EM102, B905, E721, and PLC0415 - Added empty py.typed marker to src package - Removed redundant FilterableDocsFixtureMixin from TestFAISSDocumentStore * fix: clean docstrings, add encoding and write validation - Cleaned up developer comments in _matches_filters and count_unique_metadata_by_filter - Added ValueError raising for invalid input in write_documents - Added missing docstrings for get_metadata_fields_info, delete_by_filter, etc. - Added explicit utf-8 encoding to open() calls - Removed dead pass blocks * docs: split overlong docstring to pass Ruff checks after escaping braces * adding LICENSE header to tests + converting method to static * feat: add FAISSEmbeddingRetriever component - Add components/retrievers/faiss/embedding_retriever.py with @component decorator, run(), run_async(), to_dict(), from_dict() with FilterPolicy support and backward-compat deserialization guard - Add components/__init__.py, components/retrievers/__init__.py, components/retrievers/faiss/__init__.py namespace packages - Add tests/test_embedding_retriever.py with 8 tests covering: basic run, runtime filters, top_k override, to_dict/from_dict roundtrip, FilterPolicy REPLACE/MERGE, ValueError on wrong store type, and end-to-end pipeline execution - Update pyproject.toml types script to also typecheck haystack_integrations.components.retrievers.faiss * small fixes * build a new doc instance instead of mutating score in place. * updating NOT filter handling * fixing config_docusaurus.yml --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>
1 parent 1c9b7c1 commit a7f4732

15 files changed

Lines changed: 1197 additions & 0 deletions

File tree

integrations/faiss/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# faiss-haystack
2+
3+
This package provides a [FAISS](https://github.com/facebookresearch/faiss) document store for [Haystack](https://github.com/deepset-ai/haystack).
4+
5+
## Installation
6+
7+
```bash
8+
pip install faiss-haystack
9+
```
10+
11+
## Usage
12+
13+
```python
14+
from haystack_integrations.document_stores.faiss import FAISSDocumentStore
15+
16+
document_store = FAISSDocumentStore(index_path="my_index")
17+
```
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
loaders:
2+
- modules:
3+
- haystack_integrations.components.retrievers.faiss.embedding_retriever
4+
- haystack_integrations.document_stores.faiss.document_store
5+
search_path: [../src]
6+
processors:
7+
- type: filter
8+
documented_only: true
9+
skip_empty_modules: true
10+
renderer:
11+
description: FAISS integration for Haystack
12+
id: integrations-faiss
13+
filename: faiss.md
14+
title: FAISS

integrations/faiss/pyproject.toml

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
[build-system]
2+
requires = ["hatchling", "hatch-vcs"]
3+
build-backend = "hatchling.build"
4+
5+
[project]
6+
name = "faiss-haystack"
7+
dynamic = ["version"]
8+
description = ''
9+
readme = "README.md"
10+
requires-python = ">=3.10"
11+
license = "Apache-2.0"
12+
keywords = []
13+
authors = [{ name = "Deepset", email = "info@deepset.ai" }]
14+
classifiers = [
15+
"License :: OSI Approved :: Apache Software License",
16+
"Development Status :: 4 - Beta",
17+
"Programming Language :: Python",
18+
"Programming Language :: Python :: 3.10",
19+
"Programming Language :: Python :: 3.11",
20+
"Programming Language :: Python :: 3.12",
21+
"Programming Language :: Python :: Implementation :: CPython",
22+
"Programming Language :: Python :: Implementation :: PyPy",
23+
]
24+
dependencies = [
25+
"haystack-ai>=2.24.0",
26+
"faiss-cpu>=1.8.0",
27+
"numpy",
28+
]
29+
30+
[project.urls]
31+
Documentation = "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/faiss#readme"
32+
Issues = "https://github.com/deepset-ai/haystack-core-integrations/issues"
33+
Source = "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/faiss"
34+
35+
[tool.hatch.build.targets.wheel]
36+
packages = ["src/haystack_integrations"]
37+
38+
[tool.hatch.version]
39+
source = "vcs"
40+
tag-pattern = 'integrations\/faiss-v(?P<version>.*)'
41+
42+
[tool.hatch.version.raw-options]
43+
root = "../.."
44+
git_describe_command = 'git describe --tags --match="integrations/faiss-v[0-9]*"'
45+
46+
[tool.hatch.envs.default]
47+
installer = "uv"
48+
dependencies = ["haystack-pydoc-tools", "ruff"]
49+
50+
[tool.hatch.envs.default.scripts]
51+
docs = ["pydoc-markdown pydoc/config_docusaurus.yml"]
52+
fmt = "ruff check --fix {args}; ruff format {args}"
53+
fmt-check = "ruff check {args} && ruff format --check {args}"
54+
55+
[tool.hatch.envs.test]
56+
dependencies = [
57+
"pytest",
58+
"pytest-cov",
59+
"pytest-rerunfailures",
60+
"mypy",
61+
"pandas",
62+
]
63+
64+
[tool.hatch.envs.test.scripts]
65+
unit = 'pytest -m "not integration" {args:tests}'
66+
integration = 'pytest -m "integration" {args:tests}'
67+
all = 'pytest {args:tests}'
68+
cov-retry = 'pytest --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x {args:tests}'
69+
70+
types = "mypy -p haystack_integrations.document_stores.faiss -p haystack_integrations.components.retrievers.faiss {args}"
71+
72+
[tool.mypy]
73+
install_types = true
74+
non_interactive = true
75+
check_untyped_defs = true
76+
disallow_incomplete_defs = true
77+
78+
[tool.hatch.metadata]
79+
allow-direct-references = true
80+
81+
[tool.ruff]
82+
line-length = 120
83+
84+
[tool.ruff.lint]
85+
select = [
86+
"A",
87+
"ARG",
88+
"B",
89+
"C",
90+
"DTZ",
91+
"E",
92+
"EM",
93+
"F",
94+
"FBT",
95+
"I",
96+
"ICN",
97+
"ISC",
98+
"N",
99+
"PLC",
100+
"PLE",
101+
"PLR",
102+
"PLW",
103+
"Q",
104+
"RUF",
105+
"S",
106+
"T",
107+
"TID",
108+
"UP",
109+
"W",
110+
"YTT",
111+
]
112+
ignore = [
113+
# Allow non-abstract empty methods in abstract base classes
114+
"B027",
115+
# Allow boolean positional values in function calls, like `dict.get(... True)`
116+
"FBT003",
117+
# Ignore checks for possible passwords
118+
"S105",
119+
"S106",
120+
"S107",
121+
# Ignore complexity
122+
"C901",
123+
"PLR0911",
124+
"PLR0912",
125+
"PLR0913",
126+
"PLR0915",
127+
# Ignore unused params
128+
"ARG002",
129+
# Allow assertions
130+
"S101",
131+
]
132+
exclude = ["example"]
133+
134+
[tool.ruff.lint.isort]
135+
known-first-party = ["haystack_integrations"]
136+
137+
[tool.ruff.lint.flake8-tidy-imports]
138+
ban-relative-imports = "parents"
139+
140+
[tool.ruff.lint.per-file-ignores]
141+
# Tests can use magic values, assertions, and relative imports
142+
"tests/**/*" = ["PLR2004", "S101", "TID252"]
143+
"example/**/*" = ["T201"]
144+
145+
[tool.coverage.run]
146+
source = ["haystack_integrations"]
147+
branch = true
148+
parallel = false
149+
150+
151+
[tool.coverage.report]
152+
omit = ["*/tests/*", "*/__init__.py"]
153+
show_missing = true
154+
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]
155+
156+
157+
[tool.pytest.ini_options]
158+
minversion = "6.0"
159+
markers = ["integration: integration tests"]

integrations/faiss/src/haystack_integrations/__init__.py

Whitespace-only changes.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# SPDX-FileCopyrightText: 2023-present deepset GmbH <info@deepset.ai>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# SPDX-FileCopyrightText: 2023-present deepset GmbH <info@deepset.ai>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# SPDX-FileCopyrightText: 2023-present deepset GmbH <info@deepset.ai>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
from .embedding_retriever import FAISSEmbeddingRetriever
5+
6+
__all__ = ["FAISSEmbeddingRetriever"]
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# SPDX-FileCopyrightText: 2023-present deepset GmbH <info@deepset.ai>
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from typing import Any
6+
7+
from haystack import component, default_from_dict, default_to_dict
8+
from haystack.dataclasses import Document
9+
from haystack.document_stores.types import FilterPolicy
10+
from haystack.document_stores.types.filter_policy import apply_filter_policy
11+
12+
from haystack_integrations.document_stores.faiss import FAISSDocumentStore
13+
14+
15+
@component
16+
class FAISSEmbeddingRetriever:
17+
"""
18+
Retrieves documents from the `FAISSDocumentStore`, based on their dense embeddings.
19+
20+
Example usage:
21+
```python
22+
from haystack import Document, Pipeline
23+
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
24+
from haystack.document_stores.types import DuplicatePolicy
25+
26+
from haystack_integrations.document_stores.faiss import FAISSDocumentStore
27+
from haystack_integrations.components.retrievers.faiss import FAISSEmbeddingRetriever
28+
29+
document_store = FAISSDocumentStore(embedding_dim=768)
30+
31+
documents = [
32+
Document(content="There are over 7,000 languages spoken around the world today."),
33+
Document(content="Elephants have been observed to behave in a way that indicates a high level of intelligence."),
34+
Document(content="In certain places, you can witness the phenomenon of bioluminescent waves."),
35+
]
36+
37+
document_embedder = SentenceTransformersDocumentEmbedder()
38+
document_embedder.warm_up()
39+
documents_with_embeddings = document_embedder.run(documents)["documents"]
40+
41+
document_store.write_documents(documents_with_embeddings, policy=DuplicatePolicy.OVERWRITE)
42+
43+
query_pipeline = Pipeline()
44+
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
45+
query_pipeline.add_component("retriever", FAISSEmbeddingRetriever(document_store=document_store))
46+
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
47+
48+
query = "How many languages are there?"
49+
res = query_pipeline.run({"text_embedder": {"text": query}})
50+
51+
assert res["retriever"]["documents"][0].content == "There are over 7,000 languages spoken around the world today."
52+
```
53+
""" # noqa: E501
54+
55+
def __init__(
56+
self,
57+
*,
58+
document_store: FAISSDocumentStore,
59+
filters: dict[str, Any] | None = None,
60+
top_k: int = 10,
61+
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE,
62+
):
63+
"""
64+
:param document_store: An instance of `FAISSDocumentStore`.
65+
:param filters: Filters applied to the retrieved Documents at initialisation time. At runtime, these are merged
66+
with any runtime filters according to the `filter_policy`.
67+
:param top_k: Maximum number of Documents to return.
68+
:param filter_policy: Policy to determine how init-time and runtime filters are combined.
69+
See `FilterPolicy` for details. Defaults to `FilterPolicy.REPLACE`.
70+
:raises ValueError: If `document_store` is not an instance of `FAISSDocumentStore`.
71+
"""
72+
if not isinstance(document_store, FAISSDocumentStore):
73+
msg = "document_store must be an instance of FAISSDocumentStore"
74+
raise ValueError(msg)
75+
76+
self.document_store = document_store
77+
self.filters = filters or {}
78+
self.top_k = top_k
79+
self.filter_policy = (
80+
filter_policy if isinstance(filter_policy, FilterPolicy) else FilterPolicy.from_str(filter_policy)
81+
)
82+
83+
def to_dict(self) -> dict[str, Any]:
84+
"""
85+
Serializes the component to a dictionary.
86+
87+
:returns: Dictionary with serialized data.
88+
"""
89+
return default_to_dict(
90+
self,
91+
filters=self.filters,
92+
top_k=self.top_k,
93+
filter_policy=self.filter_policy.value,
94+
document_store=self.document_store.to_dict(),
95+
)
96+
97+
@classmethod
98+
def from_dict(cls, data: dict[str, Any]) -> "FAISSEmbeddingRetriever":
99+
"""
100+
Deserializes the component from a dictionary.
101+
102+
:param data: Dictionary to deserialize from.
103+
:returns: Deserialized component.
104+
"""
105+
doc_store_params = data["init_parameters"]["document_store"]
106+
data["init_parameters"]["document_store"] = FAISSDocumentStore.from_dict(doc_store_params)
107+
return default_from_dict(cls, data)
108+
109+
@component.output_types(documents=list[Document])
110+
def run(
111+
self,
112+
query_embedding: list[float],
113+
filters: dict[str, Any] | None = None,
114+
top_k: int | None = None,
115+
) -> dict[str, list[Document]]:
116+
"""
117+
Retrieve documents from the `FAISSDocumentStore`, based on their embeddings.
118+
119+
:param query_embedding: Embedding of the query.
120+
:param filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on
121+
the `filter_policy` chosen at retriever initialization. See init method docstring for more
122+
details.
123+
:param top_k: Maximum number of Documents to return. Overrides the value set at initialization.
124+
:returns: A dictionary with the following keys:
125+
- `documents`: List of `Document`s that are similar to `query_embedding`.
126+
"""
127+
filters = apply_filter_policy(self.filter_policy, self.filters, filters)
128+
top_k = top_k or self.top_k
129+
docs = self.document_store.search(query_embedding=query_embedding, top_k=top_k, filters=filters)
130+
return {"documents": docs}
131+
132+
@component.output_types(documents=list[Document])
133+
async def run_async(
134+
self,
135+
query_embedding: list[float],
136+
filters: dict[str, Any] | None = None,
137+
top_k: int | None = None,
138+
) -> dict[str, list[Document]]:
139+
"""
140+
Asynchronously retrieve documents from the `FAISSDocumentStore`, based on their embeddings.
141+
142+
Since FAISS search is CPU-bound and fully in-memory, this delegates directly to the synchronous
143+
`run()` method. No I/O or network calls are involved.
144+
145+
:param query_embedding: Embedding of the query.
146+
:param filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on
147+
the `filter_policy` chosen at retriever initialization. See init method docstring for more
148+
details.
149+
:param top_k: Maximum number of Documents to return. Overrides the value set at initialization.
150+
:returns: A dictionary with the following keys:
151+
- `documents`: List of `Document`s that are similar to `query_embedding`.
152+
"""
153+
return self.run(query_embedding=query_embedding, filters=filters, top_k=top_k)

integrations/faiss/src/haystack_integrations/document_stores/__init__.py

Whitespace-only changes.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .document_store import FAISSDocumentStore
2+
3+
__all__ = ["FAISSDocumentStore"]

0 commit comments

Comments
 (0)