Skip to content

Commit 08a029a

Browse files
docs: add Presidio component docs pages (#11165)
1 parent 602c497 commit 08a029a

12 files changed

Lines changed: 549 additions & 1 deletion

File tree

docs-website/docs/pipeline-components/extractors.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ slug: "/extractors"
1111
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
1212
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
1313
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
14+
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
1415
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: "PresidioEntityExtractor"
3+
id: presidioentityextractor
4+
slug: "/presidioentityextractor"
5+
description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio."
6+
---
7+
8+
# PresidioEntityExtractor
9+
10+
`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
17+
| **Mandatory run variables** | `documents`: A list of Document objects |
18+
| **Output variables** | `documents`: A list of Document objects with PII metadata added |
19+
| **API reference** | [Presidio](/reference/integrations-presidio) |
20+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
21+
22+
</div>
23+
24+
## Overview
25+
26+
[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.
27+
28+
The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it.
29+
30+
## Configuration
31+
32+
| Parameter | Default | Description |
33+
| --- | --- | --- |
34+
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
35+
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
36+
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |
37+
38+
## Usage
39+
40+
Install the `presidio-haystack` package to use the `PresidioEntityExtractor`.
41+
42+
```bash
43+
pip install presidio-haystack
44+
# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model.
45+
python -m spacy download en_core_web_lg
46+
```
47+
48+
### On its own
49+
50+
```python
51+
from haystack import Document
52+
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
53+
54+
extractor = PresidioEntityExtractor()
55+
result = extractor.run(documents=[
56+
Document(content="Contact Alice at alice@example.com")
57+
])
58+
print(result["documents"][0].meta["entities"])
59+
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
60+
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
61+
```
62+
63+
### Using Custom Parameters
64+
65+
To customize entity detection, pass parameters when initializing the extractor:
66+
67+
```python
68+
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
69+
70+
extractor = PresidioEntityExtractor(
71+
language="de",
72+
entities=["PERSON", "EMAIL_ADDRESS"],
73+
score_threshold=0.7,
74+
)
75+
```

docs-website/docs/pipeline-components/preprocessors.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,7 @@ Use the PreProcessors to prepare your data normalize white spaces, remove header
1919
| [DocumentSplitter](preprocessors/documentsplitter.mdx) | Splits a list of text documents into a list of text documents with shorter texts. |
2020
| [HierarchicalDocumentSplitter](preprocessors/hierarchicaldocumentsplitter.mdx) | Creates a multi-level document structure based on parent-children relationships between text segments. |
2121
| [MarkdownHeaderSplitter](preprocessors/markdownheadersplitter.mdx) | Splits documents at ATX-style Markdown headers (#), with optional secondary splitting. Preserves header hierarchy as metadata. |
22+
| [PresidioDocumentCleaner](preprocessors/presidiodocumentcleaner.mdx) | Replaces PII in Document text with entity type placeholders using Microsoft Presidio. |
23+
| [PresidioTextCleaner](preprocessors/presidiotextcleaner.mdx) | Replaces PII in plain strings — useful for sanitizing user queries before they reach an LLM. |
2224
| [RecursiveSplitter](preprocessors/recursivesplitter.mdx) | Splits text into smaller chunks, it does so by recursively applying a list of separators <br />to the text, applied in the order they are provided. |
2325
| [TextCleaner](preprocessors/textcleaner.mdx) | Removes regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation. |
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: "PresidioDocumentCleaner"
3+
id: presidiodocumentcleaner
4+
slug: "/presidiodocumentcleaner"
5+
description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio."
6+
---
7+
8+
# PresidioDocumentCleaner
9+
10+
`PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`. Original Documents are not mutated. Documents without text content pass through unchanged.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
17+
| **Mandatory run variables** | `documents`: A list of Document objects |
18+
| **Output variables** | `documents`: A list of Document objects with PII replaced |
19+
| **API reference** | [Presidio](/reference/integrations-presidio) |
20+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
21+
22+
</div>
23+
24+
## Overview
25+
26+
[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`.
27+
28+
This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results.
29+
30+
## Configuration
31+
32+
| Parameter | Default | Description |
33+
| --- | --- | --- |
34+
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
35+
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
36+
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |
37+
38+
## Usage
39+
40+
Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`.
41+
42+
```bash
43+
pip install presidio-haystack
44+
# Download the English NLP model required by Presidio's analyzer engine
45+
python -m spacy download en_core_web_lg
46+
```
47+
48+
### On its own
49+
50+
```python
51+
from haystack import Document
52+
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
53+
54+
cleaner = PresidioDocumentCleaner()
55+
result = cleaner.run(documents=[
56+
Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
57+
])
58+
print(result["documents"][0].content)
59+
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.
60+
```
61+
62+
### In a pipeline
63+
64+
```python
65+
from haystack import Document, Pipeline
66+
from haystack.components.writers import DocumentWriter
67+
from haystack.document_stores.in_memory import InMemoryDocumentStore
68+
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
69+
70+
document_store = InMemoryDocumentStore()
71+
72+
indexing_pipeline = Pipeline()
73+
indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner())
74+
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
75+
indexing_pipeline.connect("cleaner", "writer")
76+
77+
indexing_pipeline.run({
78+
"cleaner": {
79+
"documents": [
80+
Document(content="Alice Smith's email is alice@example.com"),
81+
Document(content="Call Bob at 212-555-9876"),
82+
]
83+
}
84+
})
85+
```
86+
87+
### Using Custom Parameters
88+
89+
To customize PII detection, pass parameters when initializing the cleaner:
90+
91+
```python
92+
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
93+
94+
cleaner = PresidioDocumentCleaner(
95+
language="de",
96+
entities=["PERSON", "EMAIL_ADDRESS"],
97+
score_threshold=0.7,
98+
)
99+
```
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: "PresidioTextCleaner"
3+
id: presidiotextcleaner
4+
slug: "/presidiotextcleaner"
5+
description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio."
6+
---
7+
8+
# PresidioTextCleaner
9+
10+
`PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator |
17+
| **Mandatory run variables** | `texts`: A list of strings |
18+
| **Output variables** | `texts`: A list of strings with PII replaced |
19+
| **API reference** | [Presidio](/reference/integrations-presidio) |
20+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
21+
22+
</div>
23+
24+
## Overview
25+
26+
[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `<PERSON>` or `<US_SSN>`.
27+
28+
This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model.
29+
30+
## Configuration
31+
32+
| Parameter | Default | Description |
33+
| --- | --- | --- |
34+
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
35+
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
36+
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |
37+
38+
## Usage
39+
40+
Install the `presidio-haystack` package to use the `PresidioTextCleaner`.
41+
42+
```bash
43+
pip install presidio-haystack
44+
# Download the English NLP model required by Presidio's analyzer engine
45+
python -m spacy download en_core_web_lg
46+
```
47+
48+
### On its own
49+
50+
```python
51+
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
52+
53+
cleaner = PresidioTextCleaner()
54+
result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"])
55+
print(result["texts"][0])
56+
# My name is <PERSON>, my SSN is <US_SSN>
57+
```
58+
59+
### In a pipeline
60+
61+
```python
62+
from haystack import Pipeline
63+
from haystack.components.builders import ChatPromptBuilder
64+
from haystack.components.generators.chat import OpenAIChatGenerator
65+
from haystack.dataclasses import ChatMessage
66+
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
67+
68+
template = [ChatMessage.from_user("Answer this question: {{query}}")]
69+
70+
query_pipeline = Pipeline()
71+
query_pipeline.add_component("cleaner", PresidioTextCleaner())
72+
query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
73+
query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))
74+
query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query")
75+
query_pipeline.connect("prompt_builder", "llm")
76+
77+
query_pipeline.run({
78+
"cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]}
79+
})
80+
```
81+
82+
### Using Custom Parameters
83+
84+
To customize PII detection, pass parameters when initializing the cleaner:
85+
86+
```python
87+
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
88+
89+
cleaner = PresidioTextCleaner(
90+
language="de",
91+
entities=["PERSON", "EMAIL_ADDRESS"],
92+
score_threshold=0.7,
93+
)
94+
```

docs-website/sidebars.js

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,7 @@ export default {
352352
'pipeline-components/extractors/llmdocumentcontentextractor',
353353
'pipeline-components/extractors/llmmetadataextractor',
354354
'pipeline-components/extractors/namedentityextractor',
355+
'pipeline-components/extractors/presidioentityextractor',
355356
'pipeline-components/extractors/regextextextractor',
356357
],
357358
},
@@ -469,6 +470,8 @@ export default {
469470
'pipeline-components/preprocessors/hierarchicaldocumentsplitter',
470471
'pipeline-components/preprocessors/recursivesplitter',
471472
'pipeline-components/preprocessors/textcleaner',
473+
'pipeline-components/preprocessors/presidiodocumentcleaner',
474+
'pipeline-components/preprocessors/presidiotextcleaner',
472475
],
473476
},
474477
{

docs-website/versioned_docs/version-2.28/pipeline-components/extractors.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ slug: "/extractors"
1111
| [LLMDocumentContentExtractor](extractors/llmdocumentcontentextractor.mdx) | Extracts textual content from image-based documents using a vision-enabled Large Language Model (LLM). |
1212
| [LLMMetadataExtractor](extractors/llmmetadataextractor.mdx) | Extracts metadata from documents using a Large Language Model. The metadata is extracted by providing a prompt to a LLM that generates it. |
1313
| [NamedEntityExtractor](extractors/namedentityextractor.mdx) | Extracts predefined entities out of a piece of text and writes them into documents' meta field. |
14+
| [PresidioEntityExtractor](extractors/presidioentityextractor.mdx) | Detects PII in Documents and stores entities as structured metadata, without modifying the text. Powered by Microsoft Presidio. |
1415
| [RegexTextExtractor](extractors/regextextextractor.mdx) | Extracts text from chat messages or strings using a regular expression pattern. |
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
title: "PresidioEntityExtractor"
3+
id: presidioentityextractor
4+
slug: "/presidioentityextractor"
5+
description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio."
6+
---
7+
8+
# PresidioEntityExtractor
9+
10+
`PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.
11+
12+
<div className="key-value-table">
13+
14+
| | |
15+
| --- | --- |
16+
| **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store |
17+
| **Mandatory run variables** | `documents`: A list of Document objects |
18+
| **Output variables** | `documents`: A list of Document objects with PII metadata added |
19+
| **API reference** | [Presidio](/reference/integrations-presidio) |
20+
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
21+
22+
</div>
23+
24+
## Overview
25+
26+
[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.
27+
28+
The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it.
29+
30+
## Configuration
31+
32+
| Parameter | Default | Description |
33+
| --- | --- | --- |
34+
| `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). |
35+
| `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). |
36+
| `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. |
37+
38+
## Usage
39+
40+
Install the `presidio-haystack` package to use the `PresidioEntityExtractor`.
41+
42+
```bash
43+
pip install presidio-haystack
44+
# Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model.
45+
python -m spacy download en_core_web_lg
46+
```
47+
48+
### On its own
49+
50+
```python
51+
from haystack import Document
52+
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
53+
54+
extractor = PresidioEntityExtractor()
55+
result = extractor.run(documents=[
56+
Document(content="Contact Alice at alice@example.com")
57+
])
58+
print(result["documents"][0].meta["entities"])
59+
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
60+
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
61+
```
62+
63+
### Using Custom Parameters
64+
65+
To customize entity detection, pass parameters when initializing the extractor:
66+
67+
```python
68+
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
69+
70+
extractor = PresidioEntityExtractor(
71+
language="de",
72+
entities=["PERSON", "EMAIL_ADDRESS"],
73+
score_threshold=0.7,
74+
)
75+
```

0 commit comments

Comments
 (0)