| title | AlloyDB |
|---|---|
| id | integrations-alloydb |
| description | AlloyDB integration for Haystack |
| slug | /integrations-alloydb |
Retrieves documents from the AlloyDBDocumentStore by embedding similarity.
Must be connected to the AlloyDBDocumentStore.
__init__(
*,
document_store: AlloyDBDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
vector_function: (
Literal["cosine_similarity", "inner_product", "l2_distance"] | None
) = None,
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
) -> NoneCreate the AlloyDBEmbeddingRetriever component.
Parameters:
- document_store (
AlloyDBDocumentStore) – An instance ofAlloyDBDocumentStoreto use as the document store. - filters (
dict[str, Any] | None) – Filters applied to the retrieved documents. - top_k (
int) – Maximum number of documents to return. - vector_function (
Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None) – The similarity function to use when searching for similar embeddings. Overrides thevector_functionset in theAlloyDBDocumentStore."cosine_similarity"and"inner_product"are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the"hnsw"search strategy, make sure to use the same vector function as the one used when the HNSW index was created. If not specified, thevector_functionof theAlloyDBDocumentStoreis used. - filter_policy (
str | FilterPolicy) – Policy to determine how filters are applied at query time.FilterPolicy.REPLACE(default) replaces the init filters with the run-time filters.FilterPolicy.MERGEmerges the init filters with the run-time filters.
Raises:
ValueError– Ifdocument_storeis not an instance ofAlloyDBDocumentStore.
run(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int | None = None,
vector_function: (
Literal["cosine_similarity", "inner_product", "l2_distance"] | None
) = None,
) -> dict[str, list[Document]]Retrieve documents from the AlloyDBDocumentStore by embedding similarity.
Parameters:
- query_embedding (
list[float]) – A vector representation of the query. - filters (
dict[str, Any] | None) – Filters applied to the retrieved documents. Thefilter_policyset at initialization determines how these are combined with the init filters. - top_k (
int | None) – Maximum number of documents to return. Overrides thetop_kset at initialization. - vector_function (
Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None) – The similarity function to use when searching for similar embeddings. Overrides thevector_functionset at initialization.
Returns:
dict[str, list[Document]]– A dictionary containing thedocumentsretrieved from the document store.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> AlloyDBEmbeddingRetrieverDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
AlloyDBEmbeddingRetriever– Deserialized component.
Retrieves documents from the AlloyDBDocumentStore by keyword search.
Uses PostgreSQL full-text search (to_tsvector / plainto_tsquery) to find documents.
Must be connected to the AlloyDBDocumentStore.
__init__(
*,
document_store: AlloyDBDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
) -> NoneCreate the AlloyDBKeywordRetriever component.
Parameters:
- document_store (
AlloyDBDocumentStore) – An instance ofAlloyDBDocumentStoreto use as the document store. - filters (
dict[str, Any] | None) – Filters applied to the retrieved documents. - top_k (
int) – Maximum number of documents to return. - filter_policy (
str | FilterPolicy) – Policy to determine how filters are applied at query time.FilterPolicy.REPLACE(default) replaces the init filters with the run-time filters.FilterPolicy.MERGEmerges the init filters with the run-time filters.
Raises:
ValueError– Ifdocument_storeis not an instance ofAlloyDBDocumentStore.
run(
query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
) -> dict[str, list[Document]]Retrieve documents from the AlloyDBDocumentStore by keyword search.
Parameters:
- query (
str) – A keyword query to search for. - filters (
dict[str, Any] | None) – Filters applied to the retrieved documents. Thefilter_policyset at initialization determines how these are combined with the init filters. - top_k (
int | None) – Maximum number of documents to return. Overrides thetop_kset at initialization.
Returns:
dict[str, list[Document]]– A dictionary containing thedocumentsretrieved from the document store.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> AlloyDBKeywordRetrieverDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
AlloyDBKeywordRetriever– Deserialized component.
Bases: DocumentStore
A Document Store backed by Google Cloud AlloyDB.
Uses the pgvector extension for vector search.
AlloyDB is a fully managed, PostgreSQL-compatible database service on Google Cloud. Connection is handled securely via the AlloyDB Python Connector, which provides TLS encryption and IAM-based authorization without requiring manual SSL certificate management, firewall rules, or IP allowlisting.
Filter limitations: the NOT logical operator is not supported. Use != or not in
comparison operators to express negation.
Usage example:
import os
from haystack_integrations.document_stores.alloydb import AlloyDBDocumentStore
# Set required environment variables:
# ALLOYDB_INSTANCE_URI = "projects/MY_PROJECT/locations/MY_REGION/clusters/MY_CLUSTER/instances/MY_INSTANCE"
# ALLOYDB_USER = "my-db-user"
# ALLOYDB_PASSWORD = "my-db-password"
document_store = AlloyDBDocumentStore(
db="my-database",
embedding_dimension=768,
recreate_table=True,
)__init__(
*,
instance_uri: Secret = Secret.from_env_var("ALLOYDB_INSTANCE_URI"),
user: Secret = Secret.from_env_var("ALLOYDB_USER"),
password: Secret = Secret.from_env_var("ALLOYDB_PASSWORD", strict=False),
db: str = "postgres",
enable_iam_auth: bool = False,
ip_type: Literal["PRIVATE", "PUBLIC", "PSC"] = "PRIVATE",
create_extension: bool = True,
schema_name: str = "public",
table_name: str = "haystack_documents",
language: str = "english",
embedding_dimension: int = 768,
vector_function: Literal[
"cosine_similarity", "inner_product", "l2_distance"
] = "cosine_similarity",
recreate_table: bool = False,
search_strategy: Literal[
"exact_nearest_neighbor", "hnsw"
] = "exact_nearest_neighbor",
hnsw_recreate_index_if_exists: bool = False,
hnsw_index_creation_kwargs: dict[str, int] | None = None,
hnsw_index_name: str = "haystack_hnsw_index",
hnsw_ef_search: int | None = None,
keyword_index_name: str = "haystack_keyword_index"
) -> NoneCreates a new AlloyDBDocumentStore instance.
Connection to AlloyDB is established lazily on first use via the AlloyDB Python Connector. A specific table to store Haystack documents will be created if it doesn't exist yet.
Parameters:
- instance_uri (
Secret) – The AlloyDB instance URI in the format"projects/PROJECT/locations/REGION/clusters/CLUSTER/instances/INSTANCE". Read from theALLOYDB_INSTANCE_URIenvironment variable by default. - user (
Secret) – The database user. Read from theALLOYDB_USERenvironment variable by default. When using IAM database authentication, use the service account email (omitting.gserviceaccount.com) or the full IAM user email. - password (
Secret) – The database password. Read from theALLOYDB_PASSWORDenvironment variable by default. Not required whenenable_iam_auth=True. - db (
str) – The name of the database to connect to. Defaults to"postgres". - enable_iam_auth (
bool) – Whether to use IAM database authentication instead of a password. WhenTrue,passwordis ignored. The IAM principal must be granted the AlloyDB Client role and have an IAM database user created. See the AlloyDB documentation for details. - ip_type (
Literal['PRIVATE', 'PUBLIC', 'PSC']) – The IP address type to use for the connection."PRIVATE"(default) connects over a private VPC IP."PUBLIC"connects over a public IP."PSC"connects via Private Service Connect. - create_extension (
bool) – Whether to create the pgvector extension if it doesn't exist. Set this toTrue(default) to automatically create the extension if it is missing. Creating the extension may require superuser privileges. If set toFalse, ensure the extension is already installed; otherwise, an error will be raised. - schema_name (
str) – The name of the schema the table is created in. The schema must already exist. - table_name (
str) – The name of the table to use to store Haystack documents. - language (
str) – The language to be used to parse query and document content in keyword retrieval. To see the list of available languages, you can run the following SQL query in your PostgreSQL database:SELECT cfgname FROM pg_ts_config;. - embedding_dimension (
int) – The dimension of the embedding. - vector_function (
Literal['cosine_similarity', 'inner_product', 'l2_distance']) – The similarity function to use when searching for similar embeddings."cosine_similarity"and"inner_product"are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the"hnsw"search strategy, an index will be created that depends on thevector_functionpassed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index. - recreate_table (
bool) – Whether to recreate the table if it already exists. - search_strategy (
Literal['exact_nearest_neighbor', 'hnsw']) – The search strategy to use when searching for similar embeddings."exact_nearest_neighbor"provides perfect recall but can be slow for large numbers of documents."hnsw"is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. Important: when using the"hnsw"search strategy, an index will be created that depends on thevector_functionpassed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index. - hnsw_recreate_index_if_exists (
bool) – Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to"hnsw". - hnsw_index_creation_kwargs (
dict[str, int] | None) – Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to"hnsw". Valid arguments aremandef_construction. See the pgvector documentation for details. - hnsw_index_name (
str) – Index name for the HNSW index. - hnsw_ef_search (
int | None) – Theef_searchparameter to use at query time. Only used if search_strategy is set to"hnsw". See the pgvector documentation. - keyword_index_name (
str) – Index name for the keyword GIN index.
to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict(data: dict[str, Any]) -> AlloyDBDocumentStoreDeserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
AlloyDBDocumentStore– Deserialized component.
close() -> NoneCloses the database connection and the AlloyDB connector.
Call this when you are done using the document store to release resources.
For long-lived applications the connector runs a background refresh thread;
calling close() ensures that thread is stopped cleanly.
delete_table() -> NoneDeletes the table used to store Haystack documents.
The name of the schema (schema_name) and the name of the table (table_name)
are defined when initializing the AlloyDBDocumentStore.
count_documents() -> intReturns how many documents are in the document store.
Returns:
int– The number of documents in the document store.
filter_documents(filters: dict[str, Any] | None = None) -> list[Document]Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the documentation
Filter operator support: comparison operators (==, !=, >, >=, <, <=, in,
not in, like, not like) and logical operators AND and OR are fully supported.
The NOT logical operator is not supported — use != or not in comparison
operators instead.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply to the document list.
Returns:
list[Document]– A list of Documents that match the given filters.
Raises:
TypeError– Iffiltersis not a dictionary.ValueError– Iffilterssyntax is invalid.
write_documents(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.FAIL
) -> intWrites documents to the document store.
Parameters:
- documents (
list[Document]) – A list of Documents to write to the document store. - policy (
DuplicatePolicy) – The duplicate policy to use when writing documents.
Returns:
int– The number of documents written to the document store.
Raises:
ValueError– Ifdocumentscontains objects that are not of typeDocument.DuplicateDocumentError– If a document with the same id already exists in the document store and the policy is set toDuplicatePolicy.FAIL(or not specified).DocumentStoreError– If the write operation fails for any other reason.
delete_documents(document_ids: list[str]) -> NoneDeletes documents that match the provided document_ids from the document store.
Parameters:
- document_ids (
list[str]) – the document ids to delete
delete_all_documents() -> NoneDeletes all documents in the document store.
delete_by_filter(filters: dict[str, Any]) -> intDeletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see Haystack metadata filtering
Returns:
int– The number of documents deleted.
update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> intUpdates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see Haystack metadata filtering - meta (
dict[str, Any]) – The metadata fields to update.
Returns:
int– The number of documents updated.
count_documents_by_filter(filters: dict[str, Any]) -> intReturns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering
Returns:
int– The number of documents that match the filters.
count_unique_metadata_by_filter(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]Returns the count of unique values for each specified metadata field.
Considers only documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents. For filter syntax, see Haystack metadata filtering - metadata_fields (
list[str]) – List of metadata field names to count unique values for. Field names can include or omit the "meta." prefix.
Returns:
dict[str, int]– A dictionary mapping field names to their unique value counts.
get_metadata_fields_info() -> dict[str, dict[str, str]]Returns information about the metadata fields in the document store.
Since metadata is stored in a JSONB field, this method analyzes actual data to infer field types.
Example return:
{
'category': {'type': 'text'},
'priority': {'type': 'integer'},
}Returns:
dict[str, dict[str, str]]– A dictionary mapping field names to their type information.
get_metadata_field_min_max(field: str) -> dict[str, Any]Returns the minimum and maximum values for a metadata field.
For numeric fields (integer, real), returns numeric min/max.
For text and other non-numeric fields, returns lexicographic min/max
using the "C" collation.
Parameters:
- field (
str) – The metadata field name (with or without the "meta." prefix).
Returns:
dict[str, Any]– A dictionary withminandmaxkeys. Returns{"min": None, "max": None}when the field has no values or the store is empty.
get_metadata_field_unique_values(
field: str, filters: dict[str, Any] | None = None
) -> list[Any]Returns a list of unique values for a metadata field.
Parameters:
- field (
str) – The metadata field name (with or without the "meta." prefix). - filters (
dict[str, Any] | None) – Optional filters to restrict the documents considered.
Returns:
list[Any]– A list of unique values for the given field.