From 0ba983e073e4dabda06fe9d3b674d251c94d539e Mon Sep 17 00:00:00 2001 From: unknown Date: Tue, 21 Apr 2026 19:53:12 +0500 Subject: [PATCH 1/7] docs: add Presidio preprocessors docs page Adds documentation for PresidioDocumentCleaner, PresidioTextCleaner, and PresidioEntityExtractor under the Preprocessors section. Related: deepset-ai/haystack-core-integrations#3063 --- .../pipeline-components/preprocessors.mdx | 1 + .../preprocessors/presidio.mdx | 185 ++++++++++++++++++ docs-website/sidebars.js | 1 + .../pipeline-components/preprocessors.mdx | 1 + .../preprocessors/presidio.mdx | 185 ++++++++++++++++++ .../version-2.28-sidebars.json | 3 +- 6 files changed, 375 insertions(+), 1 deletion(-) create mode 100644 docs-website/docs/pipeline-components/preprocessors/presidio.mdx create mode 100644 docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx diff --git a/docs-website/docs/pipeline-components/preprocessors.mdx b/docs-website/docs/pipeline-components/preprocessors.mdx index 4d41e1a4af..bd3aa3360d 100644 --- a/docs-website/docs/pipeline-components/preprocessors.mdx +++ b/docs-website/docs/pipeline-components/preprocessors.mdx @@ -21,3 +21,4 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | +| [Presidio](preprocessors/presidio.mdx) | Detects and anonymizes PII in Documents and text strings using Microsoft Presidio. Includes `PresidioDocumentCleaner`, `PresidioTextCleaner`, and `PresidioEntityExtractor`. | diff --git a/docs-website/docs/pipeline-components/preprocessors/presidio.mdx b/docs-website/docs/pipeline-components/preprocessors/presidio.mdx new file mode 100644 index 0000000000..c7a2414ac8 --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/presidio.mdx @@ -0,0 +1,185 @@ +--- +title: "Presidio" +id: presidio +slug: "/presidio" +description: "Use the Presidio components to detect and anonymize PII in Documents and text strings." +--- + +# Presidio + +Use the Presidio components to detect and anonymize personally identifiable information (PII) in Documents and text strings, powered by [Microsoft Presidio](https://microsoft.github.io/presidio/). + +The integration provides three components: + +| Component | Description | +| --- | --- | +| [PresidioDocumentCleaner](#presidio-document-cleaner) | Replaces PII in Document text with entity type placeholders. | +| [PresidioTextCleaner](#presidio-text-cleaner) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | +| [PresidioEntityExtractor](#presidio-entity-extractor) | Detects PII and stores entities as structured metadata on Documents, without modifying their text. | + +All three components run locally — no external API or key required. Presidio uses spaCy NLP models under the hood. + +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces PII in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +## PresidioTextCleaner + +`PresidioTextCleaner` replaces PII in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +## PresidioEntityExtractor + +`PresidioEntityExtractor` detects PII in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +## Configuration + +All three components accept the same init parameters: + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 0286dbf439..eb5aaf8cbe 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -469,6 +469,7 @@ export default { 'pipeline-components/preprocessors/hierarchicaldocumentsplitter', 'pipeline-components/preprocessors/recursivesplitter', 'pipeline-components/preprocessors/textcleaner', + 'pipeline-components/preprocessors/presidio', ], }, { diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx index 4d41e1a4af..bd3aa3360d 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx @@ -21,3 +21,4 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | +| [Presidio](preprocessors/presidio.mdx) | Detects and anonymizes PII in Documents and text strings using Microsoft Presidio. Includes `PresidioDocumentCleaner`, `PresidioTextCleaner`, and `PresidioEntityExtractor`. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx new file mode 100644 index 0000000000..c7a2414ac8 --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx @@ -0,0 +1,185 @@ +--- +title: "Presidio" +id: presidio +slug: "/presidio" +description: "Use the Presidio components to detect and anonymize PII in Documents and text strings." +--- + +# Presidio + +Use the Presidio components to detect and anonymize personally identifiable information (PII) in Documents and text strings, powered by [Microsoft Presidio](https://microsoft.github.io/presidio/). + +The integration provides three components: + +| Component | Description | +| --- | --- | +| [PresidioDocumentCleaner](#presidio-document-cleaner) | Replaces PII in Document text with entity type placeholders. | +| [PresidioTextCleaner](#presidio-text-cleaner) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | +| [PresidioEntityExtractor](#presidio-entity-extractor) | Detects PII and stores entities as structured metadata on Documents, without modifying their text. | + +All three components run locally — no external API or key required. Presidio uses spaCy NLP models under the hood. + +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces PII in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +## PresidioTextCleaner + +`PresidioTextCleaner` replaces PII in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +## PresidioEntityExtractor + +`PresidioEntityExtractor` detects PII in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +## Configuration + +All three components accept the same init parameters: + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_sidebars/version-2.28-sidebars.json b/docs-website/versioned_sidebars/version-2.28-sidebars.json index 6986e4d154..807dc90794 100644 --- a/docs-website/versioned_sidebars/version-2.28-sidebars.json +++ b/docs-website/versioned_sidebars/version-2.28-sidebars.json @@ -456,7 +456,8 @@ "pipeline-components/preprocessors/embeddingbaseddocumentsplitter", "pipeline-components/preprocessors/hierarchicaldocumentsplitter", "pipeline-components/preprocessors/recursivesplitter", - "pipeline-components/preprocessors/textcleaner" + "pipeline-components/preprocessors/textcleaner", + "pipeline-components/preprocessors/presidio" ] }, { From 3a6b827fe696a09a0710822e71233c6a199cfe46 Mon Sep 17 00:00:00 2001 From: unknown Date: Wed, 22 Apr 2026 13:37:56 +0500 Subject: [PATCH 2/7] docs(presidio): split into per-component files and move extractor to extractors section --- .../docs/pipeline-components/extractors.mdx | 1 + .../extractors/presidioentityextractor.mdx | 66 +++++++ .../pipeline-components/preprocessors.mdx | 3 +- .../preprocessors/presidio.mdx | 185 ------------------ .../preprocessors/presidiodocumentcleaner.mdx | 90 +++++++++ .../preprocessors/presidiotextcleaner.mdx | 85 ++++++++ docs-website/sidebars.js | 4 +- .../pipeline-components/extractors.mdx | 1 + .../extractors/presidioentityextractor.mdx | 66 +++++++ .../pipeline-components/preprocessors.mdx | 3 +- .../preprocessors/presidio.mdx | 185 ------------------ .../preprocessors/presidiodocumentcleaner.mdx | 90 +++++++++ .../preprocessors/presidiotextcleaner.mdx | 85 ++++++++ .../version-2.28-sidebars.json | 4 +- 14 files changed, 494 insertions(+), 374 deletions(-) create mode 100644 docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx delete mode 100644 docs-website/docs/pipeline-components/preprocessors/presidio.mdx create mode 100644 docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx create mode 100644 docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx create mode 100644 docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx delete mode 100644 docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx create mode 100644 docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx create mode 100644 docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx diff --git a/docs-website/docs/pipeline-components/extractors.mdx b/docs-website/docs/pipeline-components/extractors.mdx index 1bd396143b..d5a13da690 100644 --- a/docs-website/docs/pipeline-components/extractors.mdx +++ b/docs-website/docs/pipeline-components/extractors.mdx @@ -11,4 +11,5 @@ slug: "/extractors" | [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). | | [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. | | [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. | +| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. | | [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. | diff --git a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx new file mode 100644 index 0000000000..d98defb143 --- /dev/null +++ b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx @@ -0,0 +1,66 @@ +--- +title: "PresidioEntityExtractor" +id: presidioentityextractor +slug: "/presidioentityextractor" +description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." +--- + +# PresidioEntityExtractor + +`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/docs/pipeline-components/preprocessors.mdx b/docs-website/docs/pipeline-components/preprocessors.mdx index bd3aa3360d..0c3a70c2e3 100644 --- a/docs-website/docs/pipeline-components/preprocessors.mdx +++ b/docs-website/docs/pipeline-components/preprocessors.mdx @@ -21,4 +21,5 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | -| [Presidio](preprocessors/presidio.mdx) | Detects and anonymizes PII in Documents and text strings using Microsoft Presidio. Includes `PresidioDocumentCleaner`, `PresidioTextCleaner`, and `PresidioEntityExtractor`. | +| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | +| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | diff --git a/docs-website/docs/pipeline-components/preprocessors/presidio.mdx b/docs-website/docs/pipeline-components/preprocessors/presidio.mdx deleted file mode 100644 index c7a2414ac8..0000000000 --- a/docs-website/docs/pipeline-components/preprocessors/presidio.mdx +++ /dev/null @@ -1,185 +0,0 @@ ---- -title: "Presidio" -id: presidio -slug: "/presidio" -description: "Use the Presidio components to detect and anonymize PII in Documents and text strings." ---- - -# Presidio - -Use the Presidio components to detect and anonymize personally identifiable information (PII) in Documents and text strings, powered by [Microsoft Presidio](https://microsoft.github.io/presidio/). - -The integration provides three components: - -| Component | Description | -| --- | --- | -| [PresidioDocumentCleaner](#presidio-document-cleaner) | Replaces PII in Document text with entity type placeholders. | -| [PresidioTextCleaner](#presidio-text-cleaner) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | -| [PresidioEntityExtractor](#presidio-entity-extractor) | Detects PII and stores entities as structured metadata on Documents, without modifying their text. | - -All three components run locally — no external API or key required. Presidio uses spaCy NLP models under the hood. - -## Installation - -```bash -pip install presidio-haystack -python -m spacy download en_core_web_lg -``` - -## PresidioDocumentCleaner - -`PresidioDocumentCleaner` replaces PII in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | -| **Mandatory run variables** | `documents`: A list of Document objects | -| **Output variables** | `documents`: A list of Document objects with PII replaced | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[ - Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") -]) -print(result["documents"][0].content) -# Contact at or . -``` - -### In a pipeline - -```python -from haystack import Document, Pipeline -from haystack.components.writers import DocumentWriter -from haystack.document_stores.in_memory import InMemoryDocumentStore -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -document_store = InMemoryDocumentStore() - -indexing_pipeline = Pipeline() -indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) -indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) -indexing_pipeline.connect("cleaner", "writer") - -indexing_pipeline.run({ - "cleaner": { - "documents": [ - Document(content="Alice Smith's email is alice@example.com"), - Document(content="Call Bob at 212-555-9876"), - ] - } -}) -``` - -## PresidioTextCleaner - -`PresidioTextCleaner` replaces PII in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | -| **Mandatory run variables** | `texts`: A list of strings | -| **Output variables** | `texts`: A list of strings with PII replaced | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner - -cleaner = PresidioTextCleaner() -result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) -print(result["texts"][0]) -# My name is , my SSN is -``` - -### In a pipeline - -```python -from haystack import Pipeline -from haystack.components.builders import ChatPromptBuilder -from haystack.components.generators.chat import OpenAIChatGenerator -from haystack.dataclasses import ChatMessage -from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner - -template = [ChatMessage.from_user("Answer this question: {{query}}")] - -query_pipeline = Pipeline() -query_pipeline.add_component("cleaner", PresidioTextCleaner()) -query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) -query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) -query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") -query_pipeline.connect("prompt_builder", "llm") - -query_pipeline.run({ - "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} -}) -``` - -## PresidioEntityExtractor - -`PresidioEntityExtractor` detects PII in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | -| **Mandatory run variables** | `documents`: A list of Document objects | -| **Output variables** | `documents`: A list of Document objects with PII metadata added | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor - -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[ - Document(content="Contact Alice at alice@example.com") -]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] -``` - -## Configuration - -All three components accept the same init parameters: - -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | - -```python -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -cleaner = PresidioDocumentCleaner( - language="de", - entities=["PERSON", "EMAIL_ADDRESS"], - score_threshold=0.7, -) -``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx new file mode 100644 index 0000000000..94fb1cc5cb --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -0,0 +1,90 @@ +--- +title: "PresidioDocumentCleaner" +id: presidiodocumentcleaner +slug: "/presidiodocumentcleaner" +description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." +--- + +# PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx new file mode 100644 index 0000000000..1b8b5a8ebb --- /dev/null +++ b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -0,0 +1,85 @@ +--- +title: "PresidioTextCleaner" +id: presidiotextcleaner +slug: "/presidiotextcleaner" +description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." +--- + +# PresidioTextCleaner + +`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index eb5aaf8cbe..ba04433a9d 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -352,6 +352,7 @@ export default { 'pipeline-components/extractors/llmdocumentcontentextractor', 'pipeline-components/extractors/llmmetadataextractor', 'pipeline-components/extractors/namedentityextractor', + 'pipeline-components/extractors/presidioentityextractor', 'pipeline-components/extractors/regextextextractor', ], }, @@ -469,7 +470,8 @@ export default { 'pipeline-components/preprocessors/hierarchicaldocumentsplitter', 'pipeline-components/preprocessors/recursivesplitter', 'pipeline-components/preprocessors/textcleaner', - 'pipeline-components/preprocessors/presidio', + 'pipeline-components/preprocessors/presidiodocumentcleaner', + 'pipeline-components/preprocessors/presidiotextcleaner', ], }, { diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx index 1bd396143b..d5a13da690 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx @@ -11,4 +11,5 @@ slug: "/extractors" | [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). | | [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. | | [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. | +| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. | | [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx new file mode 100644 index 0000000000..d98defb143 --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx @@ -0,0 +1,66 @@ +--- +title: "PresidioEntityExtractor" +id: presidioentityextractor +slug: "/presidioentityextractor" +description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." +--- + +# PresidioEntityExtractor + +`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII metadata added | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor() +result = extractor.run(documents=[ + Document(content="Contact Alice at alice@example.com") +]) +print(result["documents"][0].meta["entities"]) +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor + +extractor = PresidioEntityExtractor( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx index bd3aa3360d..0c3a70c2e3 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx @@ -21,4 +21,5 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | | [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | | [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | -| [Presidio](preprocessors/presidio.mdx) | Detects and anonymizes PII in Documents and text strings using Microsoft Presidio. Includes `PresidioDocumentCleaner`, `PresidioTextCleaner`, and `PresidioEntityExtractor`. | +| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | +| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx deleted file mode 100644 index c7a2414ac8..0000000000 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidio.mdx +++ /dev/null @@ -1,185 +0,0 @@ ---- -title: "Presidio" -id: presidio -slug: "/presidio" -description: "Use the Presidio components to detect and anonymize PII in Documents and text strings." ---- - -# Presidio - -Use the Presidio components to detect and anonymize personally identifiable information (PII) in Documents and text strings, powered by [Microsoft Presidio](https://microsoft.github.io/presidio/). - -The integration provides three components: - -| Component | Description | -| --- | --- | -| [PresidioDocumentCleaner](#presidio-document-cleaner) | Replaces PII in Document text with entity type placeholders. | -| [PresidioTextCleaner](#presidio-text-cleaner) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | -| [PresidioEntityExtractor](#presidio-entity-extractor) | Detects PII and stores entities as structured metadata on Documents, without modifying their text. | - -All three components run locally — no external API or key required. Presidio uses spaCy NLP models under the hood. - -## Installation - -```bash -pip install presidio-haystack -python -m spacy download en_core_web_lg -``` - -## PresidioDocumentCleaner - -`PresidioDocumentCleaner` replaces PII in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | -| **Mandatory run variables** | `documents`: A list of Document objects | -| **Output variables** | `documents`: A list of Document objects with PII replaced | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -cleaner = PresidioDocumentCleaner() -result = cleaner.run(documents=[ - Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") -]) -print(result["documents"][0].content) -# Contact at or . -``` - -### In a pipeline - -```python -from haystack import Document, Pipeline -from haystack.components.writers import DocumentWriter -from haystack.document_stores.in_memory import InMemoryDocumentStore -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -document_store = InMemoryDocumentStore() - -indexing_pipeline = Pipeline() -indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) -indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) -indexing_pipeline.connect("cleaner", "writer") - -indexing_pipeline.run({ - "cleaner": { - "documents": [ - Document(content="Alice Smith's email is alice@example.com"), - Document(content="Call Bob at 212-555-9876"), - ] - } -}) -``` - -## PresidioTextCleaner - -`PresidioTextCleaner` replaces PII in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | -| **Mandatory run variables** | `texts`: A list of strings | -| **Output variables** | `texts`: A list of strings with PII replaced | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner - -cleaner = PresidioTextCleaner() -result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) -print(result["texts"][0]) -# My name is , my SSN is -``` - -### In a pipeline - -```python -from haystack import Pipeline -from haystack.components.builders import ChatPromptBuilder -from haystack.components.generators.chat import OpenAIChatGenerator -from haystack.dataclasses import ChatMessage -from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner - -template = [ChatMessage.from_user("Answer this question: {{query}}")] - -query_pipeline = Pipeline() -query_pipeline.add_component("cleaner", PresidioTextCleaner()) -query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) -query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) -query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") -query_pipeline.connect("prompt_builder", "llm") - -query_pipeline.run({ - "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} -}) -``` - -## PresidioEntityExtractor - -`PresidioEntityExtractor` detects PII in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. - -
- -| | | -| --- | --- | -| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | -| **Mandatory run variables** | `documents`: A list of Document objects | -| **Output variables** | `documents`: A list of Document objects with PII metadata added | -| **API reference** | [Presidio](/reference/integrations-presidio) | -| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | - -
- -### On its own - -```python -from haystack import Document -from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor - -extractor = PresidioEntityExtractor() -result = extractor.run(documents=[ - Document(content="Contact Alice at alice@example.com") -]) -print(result["documents"][0].meta["entities"]) -# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, -# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] -``` - -## Configuration - -All three components accept the same init parameters: - -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | - -```python -from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner - -cleaner = PresidioDocumentCleaner( - language="de", - entities=["PERSON", "EMAIL_ADDRESS"], - score_threshold=0.7, -) -``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx new file mode 100644 index 0000000000..94fb1cc5cb --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -0,0 +1,90 @@ +--- +title: "PresidioDocumentCleaner" +id: presidiodocumentcleaner +slug: "/presidiodocumentcleaner" +description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." +--- + +# PresidioDocumentCleaner + +`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `` or ``. Original Documents are not mutated. Documents without text content pass through unchanged. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | +| **Mandatory run variables** | `documents`: A list of Document objects | +| **Output variables** | `documents`: A list of Document objects with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack import Document +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner() +result = cleaner.run(documents=[ + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") +]) +print(result["documents"][0].content) +# Contact at or . +``` + +### In a pipeline + +```python +from haystack import Document, Pipeline +from haystack.components.writers import DocumentWriter +from haystack.document_stores.in_memory import InMemoryDocumentStore +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +document_store = InMemoryDocumentStore() + +indexing_pipeline = Pipeline() +indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) +indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) +indexing_pipeline.connect("cleaner", "writer") + +indexing_pipeline.run({ + "cleaner": { + "documents": [ + Document(content="Alice Smith's email is alice@example.com"), + Document(content="Call Bob at 212-555-9876"), + ] + } +}) +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner + +cleaner = PresidioDocumentCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx new file mode 100644 index 0000000000..1b8b5a8ebb --- /dev/null +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -0,0 +1,85 @@ +--- +title: "PresidioTextCleaner" +id: presidiotextcleaner +slug: "/presidiotextcleaner" +description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." +--- + +# PresidioTextCleaner + +`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. + +
+ +| | | +| --- | --- | +| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | +| **Mandatory run variables** | `texts`: A list of strings | +| **Output variables** | `texts`: A list of strings with PII replaced | +| **API reference** | [Presidio](/reference/integrations-presidio) | +| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | + +
+ +## Installation + +```bash +pip install presidio-haystack +python -m spacy download en_core_web_lg +``` + +## Usage + +### On its own + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner() +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) +print(result["texts"][0]) +# My name is , my SSN is +``` + +### In a pipeline + +```python +from haystack import Pipeline +from haystack.components.builders import ChatPromptBuilder +from haystack.components.generators.chat import OpenAIChatGenerator +from haystack.dataclasses import ChatMessage +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +template = [ChatMessage.from_user("Answer this question: {{query}}")] + +query_pipeline = Pipeline() +query_pipeline.add_component("cleaner", PresidioTextCleaner()) +query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) +query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) +query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") +query_pipeline.connect("prompt_builder", "llm") + +query_pipeline.run({ + "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} +}) +``` + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + +```python +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner + +cleaner = PresidioTextCleaner( + language="de", + entities=["PERSON", "EMAIL_ADDRESS"], + score_threshold=0.7, +) +``` + +See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_sidebars/version-2.28-sidebars.json b/docs-website/versioned_sidebars/version-2.28-sidebars.json index 807dc90794..fc57827dc4 100644 --- a/docs-website/versioned_sidebars/version-2.28-sidebars.json +++ b/docs-website/versioned_sidebars/version-2.28-sidebars.json @@ -340,6 +340,7 @@ "pipeline-components/extractors/llmdocumentcontentextractor", "pipeline-components/extractors/llmmetadataextractor", "pipeline-components/extractors/namedentityextractor", + "pipeline-components/extractors/presidioentityextractor", "pipeline-components/extractors/regextextextractor" ] }, @@ -457,7 +458,8 @@ "pipeline-components/preprocessors/hierarchicaldocumentsplitter", "pipeline-components/preprocessors/recursivesplitter", "pipeline-components/preprocessors/textcleaner", - "pipeline-components/preprocessors/presidio" + "pipeline-components/preprocessors/presidiodocumentcleaner", + "pipeline-components/preprocessors/presidiotextcleaner" ] }, { From d80028e2b73e45b9166d63dc9e2c5a29a0f1b5c1 Mon Sep 17 00:00:00 2001 From: unknown Date: Wed, 22 Apr 2026 13:53:26 +0500 Subject: [PATCH 3/7] docs(presidio): sort preprocessors table alphabetically --- docs-website/docs/pipeline-components/preprocessors.mdx | 4 ++-- .../version-2.28/pipeline-components/preprocessors.mdx | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs-website/docs/pipeline-components/preprocessors.mdx b/docs-website/docs/pipeline-components/preprocessors.mdx index 0c3a70c2e3..f8e7f6669c 100644 --- a/docs-website/docs/pipeline-components/preprocessors.mdx +++ b/docs-website/docs/pipeline-components/preprocessors.mdx @@ -19,7 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. | | [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. | | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | -| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | -| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | | [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | | [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | +| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | +| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx index 0c3a70c2e3..f8e7f6669c 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors.mdx @@ -19,7 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header | [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. | | [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. | | [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. | -| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | -| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | | [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. | | [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. | +| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators
to the text, applied in the order they are provided. | +| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. | From d683e273bcca99a7dedb0553f83db078e5ccb732 Mon Sep 17 00:00:00 2001 From: unknown Date: Wed, 22 Apr 2026 13:56:37 +0500 Subject: [PATCH 4/7] docs(presidio): explain spaCy model download in installation steps --- .../pipeline-components/extractors/presidioentityextractor.mdx | 1 + .../preprocessors/presidiodocumentcleaner.mdx | 1 + .../pipeline-components/preprocessors/presidiotextcleaner.mdx | 1 + .../pipeline-components/extractors/presidioentityextractor.mdx | 1 + .../preprocessors/presidiodocumentcleaner.mdx | 1 + .../pipeline-components/preprocessors/presidiotextcleaner.mdx | 1 + 6 files changed, 6 insertions(+) diff --git a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx index d98defb143..55b7e05db0 100644 --- a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index 94fb1cc5cb..6f9d07c154 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx index 1b8b5a8ebb..d77ca1b09d 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx index d98defb143..55b7e05db0 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index 94fb1cc5cb..6f9d07c154 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx index 1b8b5a8ebb..d77ca1b09d 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -25,6 +25,7 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered ```bash pip install presidio-haystack +# Download the English NLP model required by Presidio's analyzer engine python -m spacy download en_core_web_lg ``` From c8c3be35d3283fa0a4612cb8e7ae7f55aa160754 Mon Sep 17 00:00:00 2001 From: unknown Date: Wed, 22 Apr 2026 14:14:55 +0500 Subject: [PATCH 5/7] docs(presidio): restructure PresidioEntityExtractor page per review feedback --- .../extractors/presidioentityextractor.mdx | 28 +++++++++++++------ .../extractors/presidioentityextractor.mdx | 28 +++++++++++++------ 2 files changed, 38 insertions(+), 18 deletions(-) diff --git a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx index 55b7e05db0..f43c82cbb9 100644 --- a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx @@ -21,16 +21,32 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. + +Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. + ## Installation ```bash pip install presidio-haystack -# Download the English NLP model required by Presidio's analyzer engine +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. python -m spacy download en_core_web_lg ``` +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage +Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. + ### On its own ```python @@ -46,13 +62,9 @@ print(result["documents"][0].meta["entities"]) # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize entity detection, pass parameters when initializing the extractor: ```python from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor @@ -63,5 +75,3 @@ extractor = PresidioEntityExtractor( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx index 55b7e05db0..f43c82cbb9 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx @@ -21,16 +21,32 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. + +Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. + ## Installation ```bash pip install presidio-haystack -# Download the English NLP model required by Presidio's analyzer engine +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. python -m spacy download en_core_web_lg ``` +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage +Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. + ### On its own ```python @@ -46,13 +62,9 @@ print(result["documents"][0].meta["entities"]) # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize entity detection, pass parameters when initializing the extractor: ```python from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor @@ -63,5 +75,3 @@ extractor = PresidioEntityExtractor( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. From 2f7662f53d723a4cf4008138589ac16eb937d01d Mon Sep 17 00:00:00 2001 From: unknown Date: Thu, 23 Apr 2026 12:02:39 +0500 Subject: [PATCH 6/7] docs(presidio): move installation into Usage section, drop standalone Installation heading Per sjrl review: removes the separate ## Installation section from all three Presidio component pages and moves the pip install + spaCy download block into the Usage section, right after the intro sentence. Also removes the "Unlike the cleaner components" phrasing from PresidioEntityExtractor's Overview since it's not clear in context on a standalone page. Applied to both current docs and versioned docs (version-2.28). --- .../extractors/presidioentityextractor.mdx | 16 +++++++--------- .../preprocessors/presidiodocumentcleaner.mdx | 6 +++--- .../preprocessors/presidiotextcleaner.mdx | 6 +++--- .../extractors/presidioentityextractor.mdx | 16 +++++++--------- .../preprocessors/presidiodocumentcleaner.mdx | 6 +++--- .../preprocessors/presidiotextcleaner.mdx | 6 +++--- 6 files changed, 26 insertions(+), 30 deletions(-) diff --git a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx index f43c82cbb9..dfdb52c7b9 100644 --- a/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx @@ -25,15 +25,7 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. -Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. - -## Installation - -```bash -pip install presidio-haystack -# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. -python -m spacy download en_core_web_lg -``` +The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. ## Configuration @@ -47,6 +39,12 @@ python -m spacy download en_core_web_lg Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. +```bash +pip install presidio-haystack +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. +python -m spacy download en_core_web_lg +``` + ### On its own ```python diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index 6f9d07c154..d2a12a2782 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -21,7 +21,9 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with -## Installation +## Usage + +Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. ```bash pip install presidio-haystack @@ -29,8 +31,6 @@ pip install presidio-haystack python -m spacy download en_core_web_lg ``` -## Usage - ### On its own ```python diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx index d77ca1b09d..665b956931 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -21,7 +21,9 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered -## Installation +## Usage + +Install the `presidio-haystack` package to use the `PresidioTextCleaner`. ```bash pip install presidio-haystack @@ -29,8 +31,6 @@ pip install presidio-haystack python -m spacy download en_core_web_lg ``` -## Usage - ### On its own ```python diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx index f43c82cbb9..dfdb52c7b9 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx @@ -25,15 +25,7 @@ description: "Use `PresidioEntityExtractor` to detect PII in Documents and store [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. -Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. - -## Installation - -```bash -pip install presidio-haystack -# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. -python -m spacy download en_core_web_lg -``` +The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. ## Configuration @@ -47,6 +39,12 @@ python -m spacy download en_core_web_lg Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. +```bash +pip install presidio-haystack +# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. +python -m spacy download en_core_web_lg +``` + ### On its own ```python diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index 6f9d07c154..d2a12a2782 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -21,7 +21,9 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with -## Installation +## Usage + +Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. ```bash pip install presidio-haystack @@ -29,8 +31,6 @@ pip install presidio-haystack python -m spacy download en_core_web_lg ``` -## Usage - ### On its own ```python diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx index d77ca1b09d..665b956931 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -21,7 +21,9 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered -## Installation +## Usage + +Install the `presidio-haystack` package to use the `PresidioTextCleaner`. ```bash pip install presidio-haystack @@ -29,8 +31,6 @@ pip install presidio-haystack python -m spacy download en_core_web_lg ``` -## Usage - ### On its own ```python From a2d584bd1bf43bc8bf364e74ceb956d0df0e3a96 Mon Sep 17 00:00:00 2001 From: unknown Date: Thu, 23 Apr 2026 16:23:09 +0500 Subject: [PATCH 7/7] docs(presidio): add Overview, move Configuration before Usage, add supported entities link, add Using Custom Parameters subsection Per sjrl review: adds ## Overview section to PresidioDocumentCleaner and PresidioTextCleaner pages explaining what Presidio is and when to use the component. Moves ## Configuration to right after Overview (before Usage), adds supported entities link into the entities config table row (removing standalone sentence at bottom), and moves the custom parameters code block into a ### Using Custom Parameters subsection under Usage. Applied to both current docs and versioned docs (version-2.28). --- .../preprocessors/presidiodocumentcleaner.mdx | 24 ++++++++++++------- .../preprocessors/presidiotextcleaner.mdx | 24 ++++++++++++------- .../preprocessors/presidiodocumentcleaner.mdx | 24 ++++++++++++------- .../preprocessors/presidiotextcleaner.mdx | 24 ++++++++++++------- 4 files changed, 64 insertions(+), 32 deletions(-) diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index d2a12a2782..e85d8a0295 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -21,6 +21,20 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. @@ -70,13 +84,9 @@ indexing_pipeline.run({ }) ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize PII detection, pass parameters when initializing the cleaner: ```python from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner @@ -87,5 +97,3 @@ cleaner = PresidioDocumentCleaner( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx index 665b956931..36c07cf5f9 100644 --- a/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -21,6 +21,20 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage Install the `presidio-haystack` package to use the `PresidioTextCleaner`. @@ -65,13 +79,9 @@ query_pipeline.run({ }) ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize PII detection, pass parameters when initializing the cleaner: ```python from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner @@ -82,5 +92,3 @@ cleaner = PresidioTextCleaner( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx index d2a12a2782..e85d8a0295 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiodocumentcleaner.mdx @@ -21,6 +21,20 @@ description: "Use `PresidioDocumentCleaner` to replace PII in Document text with +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. @@ -70,13 +84,9 @@ indexing_pipeline.run({ }) ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize PII detection, pass parameters when initializing the cleaner: ```python from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner @@ -87,5 +97,3 @@ cleaner = PresidioDocumentCleaner( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. diff --git a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx index 665b956931..36c07cf5f9 100644 --- a/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx +++ b/docs-website/versioned_docs/version-2.28/pipeline-components/preprocessors/presidiotextcleaner.mdx @@ -21,6 +21,20 @@ description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered +## Overview + +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `` or ``. + +This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model. + +## Configuration + +| Parameter | Default | Description | +| --- | --- | --- | +| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | +| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | +| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | + ## Usage Install the `presidio-haystack` package to use the `PresidioTextCleaner`. @@ -65,13 +79,9 @@ query_pipeline.run({ }) ``` -## Configuration +### Using Custom Parameters -| Parameter | Default | Description | -| --- | --- | --- | -| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | -| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | -| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | +To customize PII detection, pass parameters when initializing the cleaner: ```python from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner @@ -82,5 +92,3 @@ cleaner = PresidioTextCleaner( score_threshold=0.7, ) ``` - -See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types.