| title | Weaviate |
|---|---|
| id | integrations-weaviate |
| description | Weaviate integration for Haystack |
| slug | /integrations-weaviate |
A component for retrieving documents from Weaviate using the BM25 algorithm.
Example usage:
from haystack_integrations.document_stores.weaviate.document_store import (
WeaviateDocumentStore,
)
from haystack_integrations.components.retrievers.weaviate.bm25_retriever import (
WeaviateBM25Retriever,
)
document_store = WeaviateDocumentStore(url="http://localhost:8080")
retriever = WeaviateBM25Retriever(document_store=document_store)
retriever.run(query="How to make a pizza", top_k=3)__init__(
*,
document_store: WeaviateDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
) -> NoneCreate a new instance of WeaviateBM25Retriever.
Parameters:
- document_store (
WeaviateDocumentStore) – Instance of WeaviateDocumentStore that will be used from this retriever. - filters (
dict[str, Any] | None) – Custom filters applied when running the retriever - top_k (
int) – Maximum number of documents to return - filter_policy (
str | FilterPolicy) – Policy to determine how filters are applied.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> WeaviateBM25RetrieverDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
WeaviateBM25Retriever– Deserialized component.
run(
query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
) -> dict[str, list[Document]]Retrieves documents from Weaviate using the BM25 algorithm.
Parameters:
- query (
str) – The query text. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
run_async(
query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
) -> dict[str, list[Document]]Asynchronously retrieves documents from Weaviate using the BM25 algorithm.
Parameters:
- query (
str) – The query text. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
A retriever that uses Weaviate's vector search to find similar documents based on the embeddings of the query.
__init__(
*,
document_store: WeaviateDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
distance: float | None = None,
certainty: float | None = None,
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
) -> NoneCreates a new instance of WeaviateEmbeddingRetriever.
Parameters:
- document_store (
WeaviateDocumentStore) – Instance of WeaviateDocumentStore that will be used from this retriever. - filters (
dict[str, Any] | None) – Custom filters applied when running the retriever. - top_k (
int) – Maximum number of documents to return. - distance (
float | None) – The maximum allowed distance between Documents' embeddings. - certainty (
float | None) – Normalized distance between the result item and the search vector. - filter_policy (
str | FilterPolicy) – Policy to determine how filters are applied.
Raises:
ValueError– If bothdistanceandcertaintyare provided. See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more aboutdistanceandcertaintyparameters.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> WeaviateEmbeddingRetrieverDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
WeaviateEmbeddingRetriever– Deserialized component.
run(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int | None = None,
distance: float | None = None,
certainty: float | None = None,
) -> dict[str, list[Document]]Retrieves documents from Weaviate using the vector search.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return. - distance (
float | None) – The maximum allowed distance between Documents' embeddings. - certainty (
float | None) – Normalized distance between the result item and the search vector.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
Raises:
ValueError– If bothdistanceandcertaintyare provided. See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more aboutdistanceandcertaintyparameters.
run_async(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int | None = None,
distance: float | None = None,
certainty: float | None = None,
) -> dict[str, list[Document]]Asynchronously retrieves documents from Weaviate using the vector search.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return. - distance (
float | None) – The maximum allowed distance between Documents' embeddings. - certainty (
float | None) – Normalized distance between the result item and the search vector.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
Raises:
ValueError– If bothdistanceandcertaintyare provided. See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more aboutdistanceandcertaintyparameters.
A retriever that uses Weaviate's hybrid search to find similar documents based on the embeddings of the query.
__init__(
*,
document_store: WeaviateDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
alpha: float = 0.7,
max_vector_distance: float | None = None,
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
) -> NoneCreates a new instance of WeaviateHybridRetriever.
Parameters:
- document_store (
WeaviateDocumentStore) – Instance of WeaviateDocumentStore that will be used from this retriever. - filters (
dict[str, Any] | None) – Custom filters applied when running the retriever. - top_k (
int) – Maximum number of documents to return. - alpha (
float) – Blending factor for hybrid retrieval in Weaviate. Must be in the range[0.0, 1.0].
Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. alpha controls
how much each part contributes to the final score:
alpha = 0.0: only keyword (BM25) scoring is used.alpha = 1.0: only vector similarity scoring is used.- Values in between blend the two; higher values favor the vector score, lower values favor BM25.
By default, 0.7 is used which is the Weaviate server default.
See the official Weaviate docs on Hybrid Search parameters for more details:
- Hybrid search parameters
- Hybrid Search
- max_vector_distance (
float | None) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion before blending.
Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave None to
use Weaviate's default behavior without an explicit cutoff.
See the official Weaviate docs on Hybrid Search parameters for more details:
- Hybrid search parameters
- Hybrid Search
- filter_policy (
str | FilterPolicy) – Policy to determine how filters are applied.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> WeaviateHybridRetrieverDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
WeaviateHybridRetriever– Deserialized component.
run(
query: str,
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int | None = None,
alpha: float | None = None,
max_vector_distance: float | None = None,
) -> dict[str, list[Document]]Retrieves documents from Weaviate using hybrid search.
Parameters:
- query (
str) – The query text. - query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return. - alpha (
float | None) – Blending factor for hybrid retrieval in Weaviate. Must be in the range[0.0, 1.0].
Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. alpha controls
how much each part contributes to the final score:
alpha = 0.0: only keyword (BM25) scoring is used.alpha = 1.0: only vector similarity scoring is used.- Values in between blend the two; higher values favor the vector score, lower values favor BM25.
If None, the Weaviate server default is used.
See the official Weaviate docs on Hybrid Search parameters for more details:
- Hybrid search parameters
- Hybrid Search
- max_vector_distance (
float | None) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion before blending.
Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave None to
use Weaviate's default behavior without an explicit cutoff.
See the official Weaviate docs on Hybrid Search parameters for more details:
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
run_async(
query: str,
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int | None = None,
alpha: float | None = None,
max_vector_distance: float | None = None,
) -> dict[str, list[Document]]Asynchronously retrieves documents from Weaviate using hybrid search.
Parameters:
- query (
str) – The query text. - query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policychosen at retriever initialization. See init method docstring for more details. - top_k (
int | None) – The maximum number of documents to return. - alpha (
float | None) – Blending factor for hybrid retrieval in Weaviate. Must be in the range[0.0, 1.0].
Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. alpha controls
how much each part contributes to the final score:
alpha = 0.0: only keyword (BM25) scoring is used.alpha = 1.0: only vector similarity scoring is used.- Values in between blend the two; higher values favor the vector score, lower values favor BM25.
If None, the Weaviate server default is used.
See the official Weaviate docs on Hybrid Search parameters for more details:
- Hybrid search parameters
- Hybrid Search
- max_vector_distance (
float | None) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion before blending.
Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave None to
use Weaviate's default behavior without an explicit cutoff.
See the official Weaviate docs on Hybrid Search parameters for more details:
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: List of documents returned by the search engine.
Bases: Enum
Supported auth credentials for WeaviateDocumentStore.
from_class(auth_class: type[AuthCredentials]) -> SupportedAuthTypesReturn the SupportedAuthTypes enum value corresponding to the given auth credentials class.
Bases: ABC
Base class for all auth credentials supported by WeaviateDocumentStore.
Can be used to deserialize from dict any of the supported auth credentials.
to_dict() -> dict[str, Any]Converts the object to a dictionary representation for serialization.
from_dict(data: dict[str, Any]) -> AuthCredentialsConverts a dictionary representation to an auth credentials object.
resolve_value() -> (
WeaviateAuthApiKey
| WeaviateAuthBearerToken
| WeaviateAuthClientCredentials
| WeaviateAuthClientPassword
)Resolves all the secrets in the auth credentials object and returns the corresponding Weaviate object.
All subclasses must implement this method.
Bases: AuthCredentials
AuthCredentials for API key authentication.
By default it will load api_key from the environment variable WEAVIATE_API_KEY.
resolve_value() -> WeaviateAuthApiKeyResolve the API key secret and return the corresponding Weaviate auth object.
Bases: AuthCredentials
AuthCredentials for Bearer token authentication.
By default it will load access_token from the environment variable WEAVIATE_ACCESS_TOKEN,
and refresh_token from the environment variable
WEAVIATE_REFRESH_TOKEN.
WEAVIATE_REFRESH_TOKEN environment variable is optional.
resolve_value() -> WeaviateAuthBearerTokenResolve the bearer token secrets and return the corresponding Weaviate auth object.
Bases: AuthCredentials
AuthCredentials for client credentials authentication.
By default it will load client_secret from the environment variable WEAVIATE_CLIENT_SECRET, and
scope from the environment variable WEAVIATE_SCOPE.
WEAVIATE_SCOPE environment variable is optional, if set it can either be a string or a list of space
separated strings. e.g "scope1" or "scope1 scope2".
resolve_value() -> WeaviateAuthClientCredentialsResolve the client credentials secrets and return the corresponding Weaviate auth object.
Bases: AuthCredentials
AuthCredentials for username and password authentication.
By default it will load username from the environment variable WEAVIATE_USERNAME,
password from the environment variable WEAVIATE_PASSWORD, and
scope from the environment variable WEAVIATE_SCOPE.
WEAVIATE_SCOPE environment variable is optional, if set it can either be a string or a list of space
separated strings. e.g "scope1" or "scope1 scope2".
resolve_value() -> WeaviateAuthClientPasswordResolve the username and password secrets and return the corresponding Weaviate auth object.
A WeaviateDocumentStore instance you can use with Weaviate Cloud Services or self-hosted instances.
Usage example with Weaviate Cloud Services:
import os
from haystack_integrations.document_stores.weaviate.auth import AuthApiKey
from haystack_integrations.document_stores.weaviate.document_store import (
WeaviateDocumentStore,
)
os.environ["WEAVIATE_API_KEY"] = "MY_API_KEY"
document_store = WeaviateDocumentStore(
url="rAnD0mD1g1t5.something.weaviate.cloud",
auth_client_secret=AuthApiKey(),
)Usage example with self-hosted Weaviate:
from haystack_integrations.document_stores.weaviate.document_store import (
WeaviateDocumentStore,
)
document_store = WeaviateDocumentStore(url="http://localhost:8080")__init__(
*,
url: str | None = None,
collection_settings: dict[str, Any] | None = None,
auth_client_secret: AuthCredentials | None = None,
additional_headers: dict | None = None,
embedded_options: EmbeddedOptions | None = None,
additional_config: AdditionalConfig | None = None,
grpc_port: int = 50051,
grpc_secure: bool = False
) -> NoneCreate a new instance of WeaviateDocumentStore and connects to the Weaviate instance.
Parameters:
- url (
str | None) – The URL to the weaviate instance. - collection_settings (
dict[str, Any] | None) – The collection settings to use. IfNone, it will use a collection nameddefaultwith the following properties: - _original_id: text
- content: text
- blob_data: blob
- blob_mime_type: text
- score: number
The Document
metafields are omitted in the default collection settings as we can't make assumptions on the structure of the meta field. We heavily recommend to create a custom collection with the correct meta properties for your use case. Another option is relying on the automatic schema generation, but that's not recommended for production use. See the official Weaviate documentation for more information on collections and their properties. - auth_client_secret (
AuthCredentials | None) – Authentication credentials. Can be one of the following types depending on the authentication mode: AuthBearerTokento use existing access and (optionally, but recommended) refresh tokensAuthClientPasswordto use username and password for oidc Resource Owner Password flowAuthClientCredentialsto use a client secret for oidc client credential flowAuthApiKeyto use an API key- additional_headers (
dict | None) – Additional headers to include in the requests. Can be used to set OpenAI/HuggingFace keys. OpenAI/HuggingFace key looks like this:
{"X-OpenAI-Api-Key": "<THE-KEY>"}, {"X-HuggingFace-Api-Key": "<THE-KEY>"}
- embedded_options (
EmbeddedOptions | None) – If set, create an embedded Weaviate cluster inside the client. For a full list of options seeweaviate.embedded.EmbeddedOptions. - additional_config (
AdditionalConfig | None) – Additional and advanced configuration options for weaviate. - grpc_port (
int) – The port to use for the gRPC connection. - grpc_secure (
bool) – Whether to use a secure channel for the underlying gRPC API.
client: weaviate.WeaviateClientReturn the synchronous Weaviate client, creating and connecting it if necessary.
async_client: weaviate.WeaviateAsyncClientReturn the asynchronous Weaviate client, creating and connecting it if necessary.
collection: Collection[dict[str, Any], None]Return the synchronous Weaviate collection, initializing it via the client if necessary.
async_collection: CollectionAsync[dict[str, Any], None]Return the asynchronous Weaviate collection, initializing it via the async client if necessary.
close() -> NoneClose the synchronous Weaviate client connection.
close_async() -> NoneClose the asynchronous Weaviate client connection.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> WeaviateDocumentStoreDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
WeaviateDocumentStore– The deserialized component.
count_documents() -> intReturns the number of documents present in the DocumentStore.
count_documents_async() -> intAsynchronously returns the number of documents present in the DocumentStore.
count_documents_by_filter(filters: dict[str, Any]) -> intReturns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering.
Returns:
int– The number of documents that match the filters.
count_documents_by_filter_async(filters: dict[str, Any]) -> intAsynchronously returns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering.
Returns:
int– The number of documents that match the filters.
get_metadata_fields_info() -> dict[str, dict[str, str]]Returns metadata field names and their types, excluding special fields.
Special fields (content, blob_data, blob_mime_type, _original_id, score) are excluded as they are not user metadata fields.
Returns:
dict[str, dict[str, str]]– A dictionary where keys are field names and values are dictionaries containing type information, e.g.:
{
'number': {'type': 'int'},
'date': {'type': 'date'},
'category': {'type': 'text'},
'status': {'type': 'text'}
}get_metadata_fields_info_async() -> dict[str, dict[str, str]]Asynchronously returns metadata field names and their types, excluding special fields.
Special fields (content, blob_data, blob_mime_type, _original_id, score) are excluded as they are not user metadata fields.
Returns:
dict[str, dict[str, str]]– A dictionary where keys are field names and values are dictionaries containing type information, e.g.:
{
'number': {'type': 'int'},
'date': {'type': 'date'},
'category': {'type': 'text'},
'status': {'type': 'text'}
}get_metadata_field_min_max(metadata_field: str) -> dict[str, Any]Returns the minimum and maximum values for a numeric or date metadata field.
Parameters:
- metadata_field (
str) – The metadata field name to get min/max for. Can be prefixed with 'meta.' (e.g., 'meta.year' or 'year').
Returns:
dict[str, Any]– A dictionary with 'min' and 'max' keys containing the respective values.
Raises:
ValueError– If the field is not found or doesn't support min/max operations.
get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any]Asynchronously returns the minimum and maximum values for a numeric or date metadata field.
Parameters:
- metadata_field (
str) – The metadata field name to get min/max for. Can be prefixed with 'meta.' (e.g., 'meta.year' or 'year').
Returns:
dict[str, Any]– A dictionary with 'min' and 'max' keys containing the respective values.
Raises:
ValueError– If the field is not found or doesn't support min/max operations.
count_unique_metadata_by_filter(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]Returns the count of unique values for each specified metadata field.
Parameters:
- filters (
dict[str, Any]) – The filters to apply when counting unique values. For filter syntax, see Haystack metadata filtering. - metadata_fields (
list[str]) – List of metadata field names to count unique values for. Field names can be prefixed with 'meta.' (e.g., 'meta.category' or 'category').
Returns:
dict[str, int]– A dictionary mapping field names to counts of unique values.
Raises:
ValueError– If any of the requested fields don't exist in the collection schema.
count_unique_metadata_by_filter_async(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]Asynchronously returns the count of unique values for each specified metadata field.
Parameters:
- filters (
dict[str, Any]) – The filters to apply when counting unique values. For filter syntax, see Haystack metadata filtering. - metadata_fields (
list[str]) – List of metadata field names to count unique values for. Field names can be prefixed with 'meta.' (e.g., 'meta.category' or 'category').
Returns:
dict[str, int]– A dictionary mapping field names to counts of unique values.
Raises:
ValueError– If any of the requested fields don't exist in the collection schema.
get_metadata_field_unique_values(
metadata_field: str,
search_term: str | None = None,
from_: int = 0,
size: int = 10000,
) -> tuple[list[str], int]Returns unique values for a metadata field with pagination support.
Parameters:
- metadata_field (
str) – The metadata field name to get unique values for. Can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). - search_term (
str | None) – Optional term to filter documents by content before extracting unique values. If provided, only documents whose content contains this term will be considered. Note: Uses substring matching (case-sensitive, no stemming). - from_ (
int) – The starting offset for pagination (0-indexed). Defaults to 0. - size (
int) – The maximum number of unique values to return. Defaults to 10000.
Returns:
tuple[list[str], int]– A tuple of (list of unique values, total count of unique values).
Raises:
ValueError– If the field is not found in the collection schema.
get_metadata_field_unique_values_async(
metadata_field: str,
search_term: str | None = None,
from_: int = 0,
size: int = 10000,
) -> tuple[list[str], int]Asynchronously returns unique values for a metadata field with pagination support.
Parameters:
- metadata_field (
str) – The metadata field name to get unique values for. Can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). - search_term (
str | None) – Optional term to filter documents by content before extracting unique values. If provided, only documents whose content contains this term will be considered. Note: Uses substring matching (case-sensitive, no stemming). - from_ (
int) – The starting offset for pagination (0-indexed). Defaults to 0. - size (
int) – The maximum number of unique values to return. Defaults to 10000.
Returns:
tuple[list[str], int]– A tuple of (list of unique values, total count of unique values).
Raises:
ValueError– If the field is not found in the collection schema.
filter_documents(filters: dict[str, Any] | None = None) -> list[Document]Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Note: The contains filter operator is case-sensitive (substring
matching). For case-insensitive matching, normalize the value before
building the filter.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply to the document list.
Returns:
list[Document]– A list of Documents that match the given filters.
filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]Asynchronously returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Note: The contains filter operator is case-sensitive (substring
matching). For case-insensitive matching, normalize the value before
building the filter.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply to the document list.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> intWrites documents to Weaviate using the specified policy.
We recommend using a OVERWRITE policy as it's faster than other policies for Weaviate since it uses the batch API. We can't use the batch API for other policies as it doesn't return any information whether the document already exists or not. That prevents us from returning errors when using the FAIL policy or skipping a Document when using the SKIP policy.
Parameters:
- documents (
list[Document]) – A list of documents to write into the document store. - policy (
DuplicatePolicy) – DuplicatePolicy to apply when a document with the same ID already exists in the document store.
Returns:
int– The number of documents written.
Raises:
ValueError– When input is not valid.DuplicateDocumentError– When duplicate documents are found and using a FAIL policy.DocumentStoreError– When documents have failed to be batch written.
write_documents_async(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> intAsynchronously writes documents to Weaviate using the specified policy.
We recommend using a OVERWRITE policy as it's faster than other policies for Weaviate since it uses the batch API. We can't use the batch API for other policies as it doesn't return any information whether the document already exists or not. That prevents us from returning errors when using the FAIL policy or skipping a Document when using the SKIP policy.
Parameters:
- documents (
list[Document]) – A list of documents to write into the document store. - policy (
DuplicatePolicy) – DuplicatePolicy to apply when a document with the same ID already exists in the document store.
Returns:
int– The number of documents written.
Raises:
ValueError– When input is not valid.DuplicateDocumentError– When duplicate documents are found and using a FAIL policy.DocumentStoreError– When documents have failed to be batch written.
delete_documents(document_ids: list[str]) -> NoneDeletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The object_ids to delete.
delete_documents_async(document_ids: list[str]) -> NoneAsynchronously deletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The object_ids to delete.
delete_all_documents(
*, recreate_index: bool = False, batch_size: int = 1000
) -> NoneDeletes all documents in a collection.
If recreate_index is False, it keeps the collection but deletes documents iteratively. If recreate_index is True, the collection is dropped and faithfully recreated. This is recommended for performance reasons.
Parameters:
- recreate_index (
bool) – Use drop and recreate strategy. (recommended for performance) - batch_size (
int) – Only relevant if recreate_index is false. Defines the deletion batch size. Note that this parameter needs to be less or equal to the setQUERY_MAXIMUM_RESULTSvariable set for the weaviate deployment (default is 10000). Reference: https://docs.weaviate.io/weaviate/manage-objects/delete#delete-all-objects
delete_all_documents_async(
*, recreate_index: bool = False, batch_size: int = 1000
) -> NoneAsynchronously deletes all documents in a collection.
If recreate_index is False, it keeps the collection but deletes documents iteratively. If recreate_index is True, the collection is dropped and faithfully recreated. This is recommended for performance reasons.
Parameters:
- recreate_index (
bool) – Use drop and recreate strategy. (recommended for performance) - batch_size (
int) – Only relevant if recreate_index is false. Defines the deletion batch size. Note that this parameter needs to be less or equal to the setQUERY_MAXIMUM_RESULTSvariable set for the weaviate deployment (default is 10000). Reference: https://docs.weaviate.io/weaviate/manage-objects/delete#delete-all-objects
delete_by_filter(filters: dict[str, Any]) -> intDeletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see Haystack metadata filtering
Returns:
int– The number of documents deleted.
delete_by_filter_async(filters: dict[str, Any]) -> intAsynchronously deletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see Haystack metadata filtering
Returns:
int– The number of documents deleted.
update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> intUpdates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see Haystack metadata filtering - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.
update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> intAsynchronously updates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see Haystack metadata filtering - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.