| title | Document Stores |
|---|---|
| id | document-stores-api |
| description | Stores your texts and meta data and provides them to the Retriever at query time. |
| slug | /document-stores-api |
A dataclass for managing document statistics for BM25 retrieval.
Parameters:
- freq_token (
dict[str, int]) – A Counter of token frequencies in the document. - doc_len (
int) – Number of tokens in the document.
Stores data in-memory. It's ephemeral and cannot be saved to disk.
__init__(
bm25_tokenization_regex: str = "(?u)\\b\\w+\\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
bm25_parameters: dict | None = None,
embedding_similarity_function: Literal[
"dot_product", "cosine"
] = "dot_product",
index: str | None = None,
async_executor: ThreadPoolExecutor | None = None,
return_embedding: bool = True,
) -> NoneInitializes the DocumentStore.
Parameters:
- bm25_tokenization_regex (
str) – The regular expression used to tokenize the text for BM25 retrieval. - bm25_algorithm (
Literal['BM25Okapi', 'BM25L', 'BM25Plus']) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". - bm25_parameters (
dict | None) – Parameters for BM25 implementation in a dictionary format. For example:{'k1':1.5, 'b':0.75, 'epsilon':0.25}You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. - embedding_similarity_function (
Literal['dot_product', 'cosine']) – The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model. - index (
str | None) – A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. - async_executor (
ThreadPoolExecutor | None) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is True.
shutdown() -> NoneExplicitly shutdown the executor if we own it.
storage: dict[str, Document]Utility property that returns the storage used by this instance of InMemoryDocumentStore.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> InMemoryDocumentStoreDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
InMemoryDocumentStore– The deserialized component.
save_to_disk(path: str) -> NoneWrite the database and its data to disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
load_from_disk(path: str) -> InMemoryDocumentStoreLoad the database and its data from disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
Returns:
InMemoryDocumentStore– The loaded InMemoryDocumentStore.
count_documents() -> intReturns the number of documents present in the DocumentStore.
filter_documents(filters: dict[str, Any] | None = None) -> list[Document]Returns the documents that match the filters provided.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> intRefer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents(document_ids: list[str]) -> NoneDeletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The document_ids to delete.
delete_all_documents() -> NoneDeletes all documents in the document store.
update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> intUpdates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see filter_documents. - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.
Raises:
ValueError– if filters have invalid syntax.
delete_by_filter(filters: dict[str, Any]) -> intDeletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see filter_documents.
Returns:
int– The number of documents deleted.
Raises:
ValueError– if filters have invalid syntax.
count_documents_by_filter(filters: dict[str, Any]) -> intReturns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
int– The number of documents that match the filters.
count_unique_metadata_by_filter(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]Returns the number of unique values for each specified metadata field from documents matching the filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation. - metadata_fields (
list[str]) – List of field names to count unique values for. Field names can include or omit the "meta." prefix.
Returns:
dict[str, int]– A dictionary mapping each metadata field name (without "meta." prefix) to the count of its unique values among the filtered documents.
get_metadata_fields_info() -> dict[str, dict[str, str]]Returns information about the metadata fields present in the stored documents.
Types are inferred from the stored values (keyword, int, float, boolean).
Returns:
dict[str, dict[str, str]]– A dictionary mapping each metadata field name to a dict with a "type" key.
get_metadata_field_min_max(metadata_field: str) -> dict[str, Any]Returns the minimum and maximum values for the given metadata field across all documents.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix.
Returns:
dict[str, Any]– A dictionary with "min" and "max" keys. Returns{"min": None, "max": None}if the field is missing or has no values.
get_metadata_field_unique_values(
metadata_field: str, search_term: str | None = None
) -> tuple[list[str], int]Returns unique values for a metadata field, optionally filtered by a search term in content.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix. - search_term (
str | None) – If set, only documents whose content contains this term (case-insensitive) are considered.
Returns:
tuple[list[str], int]– A tuple of (list of unique values, total count of unique values).
bm25_retrieval(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool | None = False,
) -> list[Document]Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool | None) – Whether to return the embedding of the retrieved Documents. If not provided, the value of thereturn_embeddingparameter set at component initialization will be used. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
Raises:
ValueError– if filters have invalid syntax.
count_documents_async() -> intReturns the number of documents present in the DocumentStore.
filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]Returns the documents that match the filters provided.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents_async(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> intRefer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents_async(document_ids: list[str]) -> NoneDeletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The document_ids to delete.
bm25_retrieval_async(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval_async(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False,
) -> list[Document]Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.