diff --git a/docs-website/reference/integrations-api/presidio.md b/docs-website/reference/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference/integrations-api/presidio.md +++ b/docs-website/reference/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.19/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.20/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.21/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.22/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.23/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.24/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.25/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.26/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.27/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner diff --git a/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md b/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md index 0d19fbfd8f..d17af77bee 100644 --- a/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md +++ b/docs-website/reference_versioned_docs/version-2.28/integrations-api/presidio.md @@ -6,31 +6,34 @@ slug: "/integrations-presidio" --- -## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner +## haystack_integrations.components.extractors.presidio.presidio_entity_extractor -### PresidioDocumentCleaner +### PresidioEntityExtractor -Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). +Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. -Accepts a list of Documents, detects personally identifiable information (PII) in their -text content, and returns new Documents with PII replaced by entity type placeholders -(e.g. ``, ``). Original Documents are not mutated. +See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. -Documents without text content are passed through unchanged. +Accepts a list of Documents and returns new Documents with detected PII entities stored +in each Document's metadata under the key `"entities"`. Each entry in the list contains +the entity type, start/end character offsets, and the confidence score. -The analyzer and anonymizer engines are loaded on the first call to `run()`, +Original Documents are not mutated. Documents without text content are passed through unchanged. + +The analyzer engine is loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) -print(result["documents"][0].content) -# My name is and my email is +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` #### __init__ @@ -44,16 +47,16 @@ __init__( ) -> None ``` -Initializes the PresidioDocumentCleaner. +Initializes the PresidioEntityExtractor. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are used. +- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are detected. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner. warm_up() -> None ``` -Initializes the Presidio analyzer and anonymizer engines. +Initializes the Presidio analyzer engine. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -73,44 +76,42 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Anonymizes PII in the provided Documents. +Detects PII entities in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. +- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. - -## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities + stored in metadata under the key `"entities"`. -### PresidioEntityExtractor +## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner -Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer. +### PresidioDocumentCleaner -See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details. +Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/). -Accepts a list of Documents and returns new Documents with detected PII entities stored -in each Document's metadata under the key `"entities"`. Each entry in the list contains -the entity type, start/end character offsets, and the confidence score. +Accepts a list of Documents, detects personally identifiable information (PII) in their +text content, and returns new Documents with PII replaced by entity type placeholders +(e.g. ``, ``). Original Documents are not mutated. -Original Documents are not mutated. Documents without text content are passed through unchanged. +Documents without text content are passed through unchanged. -The analyzer engine is loaded on the first call to `run()`, +The analyzer and anonymizer engines are loaded on the first call to `run()`, or by calling `warm_up()` explicitly beforehand. ### Usage example ```python from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")]) +print(result["documents"][0].content) +# My name is and my email is ``` #### __init__ @@ -124,16 +125,16 @@ __init__( ) -> None ``` -Initializes the PresidioEntityExtractor. +Initializes the PresidioDocumentCleaner. **Parameters:** - **language** (str) – Language code for PII detection. Defaults to `"en"`. See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/). -- **entities** (list\[str\] | None) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). - If `None`, all supported entity types are detected. +- **entities** (list\[str\] | None) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`). + If `None`, all supported entity types are used. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/). -- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`. +- **score_threshold** (float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`. See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/). #### warm_up @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor. warm_up() -> None ``` -Initializes the Presidio analyzer engine. +Initializes the Presidio analyzer and anonymizer engines. This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run. @@ -153,16 +154,15 @@ this is called automatically before the first run. run(documents: list[Document]) -> dict[str, list[Document]] ``` -Detects PII entities in the provided Documents. +Anonymizes PII in the provided Documents. **Parameters:** -- **documents** (list\[Document\]) – List of Documents to analyze for PII entities. +- **documents** (list\[Document\]) – List of Documents whose text content will be anonymized. **Returns:** -- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing Documents with detected entities - stored in metadata under the key `"entities"`. +- dict\[str, list\[Document\]\] – A dictionary with key `documents` containing the cleaned Documents. ## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner