diff --git a/docs-website/docs/pipeline-components/extractors.mdx b/docs-website/docs/pipeline-components/extractors.mdx index 1bd396143b..d5a13da690 100644 --- a/docs-website/docs/pipeline-components/extractors.mdx +++ b/docs-website/docs/pipeline-components/extractors.mdx @@ -11,4 +11,5 @@ slug: "/extractors" | [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). | | [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. | | [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. | +| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. | | [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. | diff --git a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx new file mode 100644 index 0000000000..dfdb52c7b9 --- /dev/null +++ b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx @@ -0,0 +1,75 @@ +--- +title: "PresidioEntityExtractor" +id: presidioentityextractor +slug: "/presidioentityextractor" +description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." +--- + +# PresidioEntityExtractor + +`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. + +The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. + +```bash +pip install presidio-haystack +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +### Using Custom Parameters + +To customize entity detection, pass parameters when initializing the extractor: + +```python +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/docs/pipeline-components/preprocessors.mdx b/docs-website/docs/pipeline-components/preprocessors.mdx index 4d41e1a4af..f8e7f6669c 100644 --- a/docs-website/docs/pipeline-components/preprocessors.mdx +++ b/docs-website/docs/pipeline-components/preprocessors.mdx @@ -19,5 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. | | [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. | | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | +| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | +| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx new file mode 100644 index 0000000000..e85d8a0295 --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -0,0 +1,99 @@ +--- +title: "PresidioDocumentCleaner" +id: presidiodocumentcleaner +slug: "/presidiodocumentcleaner" +description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." +--- + +# PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. + +```bash +pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +### Using Custom Parameters + +To customize PII detection, pass parameters when initializing the cleaner: + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx new file mode 100644 index 0000000000..36c07cf5f9 --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -0,0 +1,94 @@ +--- +title: "PresidioTextCleaner" +id: presidiotextcleaner +slug: "/presidiotextcleaner" +description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." +--- + +# PresidioTextCleaner + +`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioTextCleaner`. + +```bash +pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +### Using Custom Parameters + +To customize PII detection, pass parameters when initializing the cleaner: + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 0286dbf439..ba04433a9d 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -352,6 +352,7 @@ export default { 'pipeline-components/extractors/llmdocumentcontentextractor', 'pipeline-components/extractors/llmmetadataextractor', 'pipeline-components/extractors/namedentityextractor', + 'pipeline-components/extractors/presidioentityextractor', 'pipeline-components/extractors/regextextextractor', ], }, @@ -469,6 +470,8 @@ export default { 'pipeline-components/preprocessors/hierarchicaldocumentsplitter', 'pipeline-components/preprocessors/recursivesplitter', 'pipeline-components/preprocessors/textcleaner', + 'pipeline-components/preprocessors/presidiodocumentcleaner', + 'pipeline-components/preprocessors/presidiotextcleaner', ], }, { diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx index 1bd396143b..d5a13da690 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx @@ -11,4 +11,5 @@ slug: "/extractors" | [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). | | [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. | | [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. | +| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. | | [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx new file mode 100644 index 0000000000..dfdb52c7b9 --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx @@ -0,0 +1,75 @@ +--- +title: "PresidioEntityExtractor" +id: presidioentityextractor +slug: "/presidioentityextractor" +description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." +--- + +# PresidioEntityExtractor + +`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. + +The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. + +```bash +pip install presidio-haystack +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +### Using Custom Parameters + +To customize entity detection, pass parameters when initializing the extractor: + +```python +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx index 4d41e1a4af..f8e7f6669c 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx @@ -19,5 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. | | [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. | | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | +| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | +| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx new file mode 100644 index 0000000000..e85d8a0295 --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -0,0 +1,99 @@ +--- +title: "PresidioDocumentCleaner" +id: presidiodocumentcleaner +slug: "/presidiodocumentcleaner" +description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." +--- + +# PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. + +```bash +pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +### Using Custom Parameters + +To customize PII detection, pass parameters when initializing the cleaner: + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx new file mode 100644 index 0000000000..36c07cf5f9 --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -0,0 +1,94 @@ +--- +title: "PresidioTextCleaner" +id: presidiotextcleaner +slug: "/presidiotextcleaner" +description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." +--- + +# PresidioTextCleaner + +`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +## Usage + +Install the `presidio-haystack` package to use the `PresidioTextCleaner`. + +```bash +pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine +python -m spacy download en_core_web_lg +``` + +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +### Using Custom Parameters + +To customize PII detection, pass parameters when initializing the cleaner: + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` diff --git a/docs-website/versioned_sidebars/version-2.28-sidebars.json b/docs-website/versioned_sidebars/version-2.28-sidebars.json index 6986e4d154..fc57827dc4 100644 --- a/docs-website/versioned_sidebars/version-2.28-sidebars.json +++ b/docs-website/versioned_sidebars/version-2.28-sidebars.json @@ -340,6 +340,7 @@ "pipeline-components/extractors/llmdocumentcontentextractor", "pipeline-components/extractors/llmmetadataextractor", "pipeline-components/extractors/namedentityextractor", + "pipeline-components/extractors/presidioentityextractor", "pipeline-components/extractors/regextextextractor" ] }, @@ -456,7 +457,9 @@ "pipeline-components/preprocessors/embeddingbaseddocumentsplitter", "pipeline-components/preprocessors/hierarchicaldocumentsplitter", "pipeline-components/preprocessors/recursivesplitter", - "pipeline-components/preprocessors/textcleaner" + "pipeline-components/preprocessors/textcleaner", + "pipeline-components/preprocessors/presidiodocumentcleaner", + "pipeline-components/preprocessors/presidiotextcleaner" ] }, {