Skip to content

Commit 193cde6

Browse files
HaystackBotsjrl
andauthored
docs: sync Core Integrations API reference (presidio) on Docusaurus (#11161)
Co-authored-by: sjrl <10526848+sjrl@users.noreply.github.com>
1 parent 8996e48 commit 193cde6

12 files changed

Lines changed: 2868 additions & 0 deletions

File tree

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
---
2+
title: "Presidio"
3+
id: integrations-presidio
4+
description: "Presidio integration for Haystack"
5+
slug: "/integrations-presidio"
6+
---
7+
8+
9+
## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
10+
11+
### PresidioDocumentCleaner
12+
13+
Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).
14+
15+
Accepts a list of Documents, detects personally identifiable information (PII) in their
16+
text content, and returns new Documents with PII replaced by entity type placeholders
17+
(e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.
18+
19+
Documents without text content are passed through unchanged.
20+
21+
The analyzer and anonymizer engines are loaded on the first call to `run()`,
22+
or by calling `warm_up()` explicitly beforehand.
23+
24+
### Usage example
25+
26+
```python
27+
from haystack import Document
28+
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
29+
30+
cleaner = PresidioDocumentCleaner()
31+
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
32+
print(result["documents"][0].content)
33+
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
34+
```
35+
36+
#### __init__
37+
38+
```python
39+
__init__(
40+
*,
41+
language: str = "en",
42+
entities: list[str] | None = None,
43+
score_threshold: float = 0.35
44+
) -> None
45+
```
46+
47+
Initializes the PresidioDocumentCleaner.
48+
49+
**Parameters:**
50+
51+
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
52+
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
53+
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
54+
If `None`, all supported entity types are used.
55+
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
56+
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
57+
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
58+
59+
#### warm_up
60+
61+
```python
62+
warm_up() -> None
63+
```
64+
65+
Initializes the Presidio analyzer and anonymizer engines.
66+
67+
This method loads the underlying NLP models. In a Haystack Pipeline,
68+
this is called automatically before the first run.
69+
70+
#### run
71+
72+
```python
73+
run(documents: list[Document]) -> dict[str, list[Document]]
74+
```
75+
76+
Anonymizes PII in the provided Documents.
77+
78+
**Parameters:**
79+
80+
- **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.
81+
82+
**Returns:**
83+
84+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.
85+
86+
## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor
87+
88+
### PresidioEntityExtractor
89+
90+
Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
91+
92+
See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.
93+
94+
Accepts a list of Documents and returns new Documents with detected PII entities stored
95+
in each Document's metadata under the key `"entities"`. Each entry in the list contains
96+
the entity type, start/end character offsets, and the confidence score.
97+
98+
Original Documents are not mutated. Documents without text content are passed through unchanged.
99+
100+
The analyzer engine is loaded on the first call to `run()`,
101+
or by calling `warm_up()` explicitly beforehand.
102+
103+
### Usage example
104+
105+
```python
106+
from haystack import Document
107+
from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor
108+
109+
extractor = PresidioEntityExtractor()
110+
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
111+
print(result["documents"][0].meta["entities"])
112+
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
113+
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
114+
```
115+
116+
#### __init__
117+
118+
```python
119+
__init__(
120+
*,
121+
language: str = "en",
122+
entities: list[str] | None = None,
123+
score_threshold: float = 0.35
124+
) -> None
125+
```
126+
127+
Initializes the PresidioEntityExtractor.
128+
129+
**Parameters:**
130+
131+
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
132+
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
133+
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
134+
If `None`, all supported entity types are detected.
135+
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
136+
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
137+
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
138+
139+
#### warm_up
140+
141+
```python
142+
warm_up() -> None
143+
```
144+
145+
Initializes the Presidio analyzer engine.
146+
147+
This method loads the underlying NLP models. In a Haystack Pipeline,
148+
this is called automatically before the first run.
149+
150+
#### run
151+
152+
```python
153+
run(documents: list[Document]) -> dict[str, list[Document]]
154+
```
155+
156+
Detects PII entities in the provided Documents.
157+
158+
**Parameters:**
159+
160+
- **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.
161+
162+
**Returns:**
163+
164+
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
165+
stored in metadata under the key `"entities"`.
166+
167+
## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner
168+
169+
### PresidioTextCleaner
170+
171+
Anonymizes PII in plain strings using [Microsoft Presidio](https://microsoft.github.io/presidio/).
172+
173+
Accepts a list of strings, detects personally identifiable information (PII), and returns
174+
a new list of strings with PII replaced by entity type placeholders (e.g. `<PERSON>`).
175+
Useful for sanitizing user queries before they are sent to an LLM.
176+
177+
The analyzer and anonymizer engines are loaded on the first call to `run()`,
178+
or by calling `warm_up()` explicitly beforehand.
179+
180+
### Usage example
181+
182+
```python
183+
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
184+
185+
cleaner = PresidioTextCleaner()
186+
result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"])
187+
print(result["texts"][0])
188+
# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
189+
```
190+
191+
#### __init__
192+
193+
```python
194+
__init__(
195+
*,
196+
language: str = "en",
197+
entities: list[str] | None = None,
198+
score_threshold: float = 0.35
199+
) -> None
200+
```
201+
202+
Initializes the PresidioTextCleaner.
203+
204+
**Parameters:**
205+
206+
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
207+
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
208+
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`).
209+
If `None`, all supported entity types are used.
210+
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
211+
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
212+
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
213+
214+
#### warm_up
215+
216+
```python
217+
warm_up() -> None
218+
```
219+
220+
Initializes the Presidio analyzer and anonymizer engines.
221+
222+
This method loads the underlying NLP models. In a Haystack Pipeline,
223+
this is called automatically before the first run.
224+
225+
#### run
226+
227+
```python
228+
run(texts: list[str]) -> dict[str, list[str]]
229+
```
230+
231+
Anonymizes PII in the provided strings.
232+
233+
**Parameters:**
234+
235+
- **texts** (<code>list\[str\]</code>) – List of strings to anonymize.
236+
237+
**Returns:**
238+
239+
- <code>dict\[str, list\[str\]\]</code> – A dictionary with key `texts` containing the cleaned strings.

0 commit comments

Comments
 (0)