diff --git a/docs-website/reference/integrations-api/presidio.md b/docs-website/reference/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings. diff --git a/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md new file mode 100644 index 0000000000..0d19fbfd8f --- /dev/null +++ b/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md @@ -0,0 +1,239 @@ +--- +title: "Presidio" +id: integrations-presidio +description: "Presidio integration for Haystack" +slug: "/integrations-presidio" +--- + + +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner + +### PresidioDocumentCleaner + +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. + +Documents without text content are passed through unchanged. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioDocumentCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Anonymizes PII in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. + +## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor + +### PresidioEntityExtractor + +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. + +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. + +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. + +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioEntityExtractor. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer engine. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(documents: list[Document]) -> dict[str, list[Document]] +``` + +Detects PII entities in the provided Documents. + +**Parameters:** + +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. + +**Returns:** + +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. + +## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner + +### PresidioTextCleaner + +Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/). + +Accepts a list of strings, detects personally identifiable information (PII), and returns +a new list of strings with PII replaced by entity type placeholders (e.g. ``). +Useful for sanitizing user queries before they are sent to an LLM. + +The analyzer and anonymizer engines are loaded on the first call to `run()`, +or by calling `warm_up()` explicitly beforehand. + +### Usage example + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"]) +print(result["texts"][0]) +# Hi, I am , call me at +``` + +#### __init__ + +```python +__init__( + *, + language: str = "en", + entities: list[str] | None = None, + score_threshold: float = 0.35 +) -> None +``` + +Initializes the PresidioTextCleaner. + +**Parameters:** + +- **language** (str) – Language code for PII detection. Defaults to `"en"`. + See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`). + If `None`, all supported entity types are used. + See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. + See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). + +#### warm_up + +```python +warm_up() -> None +``` + +Initializes the Presidio analyzer and anonymizer engines. + +This method loads the underlying NLP models. In a Haystack Pipeline, +this is called automatically before the first run. + +#### run + +```python +run(texts: list[str]) -> dict[str, list[str]] +``` + +Anonymizes PII in the provided strings. + +**Parameters:** + +- **texts** (list\[str\]) – List of strings to anonymize. + +**Returns:** + +- dict\[str, list\[str\]\] – A dictionary with key `texts` containing the cleaned strings.