Skip to content
1 change: 1 addition & 0 deletions docs-website/docs/pipeline-components/extractors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ slug: "/extractors"
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: "PresidioEntityExtractor"
id: presidioentityextractor
slug: "/presidioentityextractor"
description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio."
---

# PresidioEntityExtractor

`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
| **Mandatory run variables** | `documents`: A list of Document objects |
| **Output variables** | `documents`: A list of Document objects with PII metadata added |
| **API reference** | [Presidio](/reference/integrations-presidio) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |

</div>

Comment thread
sjrl marked this conversation as resolved.
## Overview

[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.

The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it.

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |

## Usage

Install the `presidio-haystack` package to use the `PresidioEntityExtractor`.

Comment thread
sjrl marked this conversation as resolved.
```bash
pip install presidio-haystack
# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model.
python -m spacy download en_core_web_lg
```

### On its own

```python
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor()
result = extractor.run(documents=[
Document(content="Contact Alice at alice@example.com")
])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
```

### Using Custom Parameters

To customize entity detection, pass parameters when initializing the extractor:

```python
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.7,
)
```
Comment thread
sjrl marked this conversation as resolved.
2 changes: 2 additions & 0 deletions docs-website/docs/pipeline-components/preprocessors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header
| [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
| [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
| [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. |
| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. |
| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. |
Comment thread
sjrl marked this conversation as resolved.
| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "PresidioDocumentCleaner"
id: presidiodocumentcleaner
slug: "/presidiodocumentcleaner"
description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio."
---

# PresidioDocumentCleaner

`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`. Original Documents are not mutated. Documents without text content pass through unchanged.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
| **Mandatory run variables** | `documents`: A list of Document objects |
| **Output variables** | `documents`: A list of Document objects with PII replaced |
| **API reference** | [Presidio](/reference/integrations-presidio) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |

</div>

Comment thread
sjrl marked this conversation as resolved.
## Overview

[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`.

This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results.

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |

## Usage

Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`.

```bash
pip install presidio-haystack
# Download the English NLP model required by Presidio's analyzer engine
python -m spacy download en_core_web_lg
```

### On its own

```python
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[
Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
])
print(result["documents"][0].content)
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.
```

### In a pipeline

```python
from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("cleaner", "writer")

indexing_pipeline.run({
"cleaner": {
"documents": [
Document(content="Alice Smith's email is alice@example.com"),
Document(content="Call Bob at 212-555-9876"),
]
}
})
```

### Using Custom Parameters

To customize PII detection, pass parameters when initializing the cleaner:

```python
Comment thread
sjrl marked this conversation as resolved.
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.7,
)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: "PresidioTextCleaner"
id: presidiotextcleaner
slug: "/presidiotextcleaner"
description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio."
---

# PresidioTextCleaner

`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator |
| **Mandatory run variables** | `texts`: A list of strings |
| **Output variables** | `texts`: A list of strings with PII replaced |
| **API reference** | [Presidio](/reference/integrations-presidio) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |

</div>

Comment thread
sjrl marked this conversation as resolved.
## Overview

[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `<PERSON>` or `<US_SSN>`.

This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model.

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |

## Usage

Install the `presidio-haystack` package to use the `PresidioTextCleaner`.

```bash
pip install presidio-haystack
# Download the English NLP model required by Presidio's analyzer engine
python -m spacy download en_core_web_lg
```

### On its own

```python
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner

cleaner = PresidioTextCleaner()
result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"])
print(result["texts"][0])
# My name is <PERSON>, my SSN is <US_SSN>
```

### In a pipeline

```python
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner

template = [ChatMessage.from_user("Answer this question: {{query}}")]

query_pipeline = Pipeline()
query_pipeline.add_component("cleaner", PresidioTextCleaner())
query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))
query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query")
query_pipeline.connect("prompt_builder", "llm")

query_pipeline.run({
"cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]}
})
```

### Using Custom Parameters

To customize PII detection, pass parameters when initializing the cleaner:

```python
Comment thread
sjrl marked this conversation as resolved.
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner

cleaner = PresidioTextCleaner(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.7,
)
```
3 changes: 3 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,7 @@ export default {
'pipeline-components/extractors/llmdocumentcontentextractor',
'pipeline-components/extractors/llmmetadataextractor',
'pipeline-components/extractors/namedentityextractor',
'pipeline-components/extractors/presidioentityextractor',
'pipeline-components/extractors/regextextextractor',
],
},
Expand Down Expand Up @@ -469,6 +470,8 @@ export default {
'pipeline-components/preprocessors/hierarchicaldocumentsplitter',
'pipeline-components/preprocessors/recursivesplitter',
'pipeline-components/preprocessors/textcleaner',
'pipeline-components/preprocessors/presidiodocumentcleaner',
'pipeline-components/preprocessors/presidiotextcleaner',
],
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ slug: "/extractors"
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
title: "PresidioEntityExtractor"
id: presidioentityextractor
slug: "/presidioentityextractor"
description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio."
---

# PresidioEntityExtractor

`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.

<div className="key-value-table">

| | |
| --- | --- |
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
| **Mandatory run variables** | `documents`: A list of Document objects |
| **Output variables** | `documents`: A list of Document objects with PII metadata added |
| **API reference** | [Presidio](/reference/integrations-presidio) |
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |

</div>

## Overview

[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.

The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it.

## Configuration

| Parameter | Default | Description |
| --- | --- | --- |
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |

## Usage

Install the `presidio-haystack` package to use the `PresidioEntityExtractor`.

```bash
pip install presidio-haystack
# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model.
python -m spacy download en_core_web_lg
```

### On its own

```python
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor()
result = extractor.run(documents=[
Document(content="Contact Alice at alice@example.com")
])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
```

### Using Custom Parameters

To customize entity detection, pass parameters when initializing the extractor:

```python
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.7,
)
```
Loading
Loading