Skip to content

Commit 227d03f

Browse files
feat: add Presidio integration page (#455)
* feat: add Presidio integration page Adds integration tile for presidio-haystack with usage examples for PresidioDocumentCleaner, PresidioTextCleaner, and PresidioEntityExtractor. Related: deepset-ai/haystack-core-integrations#3063 * docs(presidio): update PresidioEntityExtractor import path to extractors * docs(presidio): address review feedback - Add Shahmeer Ali as co-author - Remove unrelated thunderbolt files - Expand installation section with spaCy model guidance, language support note, and sm vs lg clarification * Apply suggestions from code review Co-authored-by: Kacper Łukawski <kacperlukawski@users.noreply.github.com> --------- Co-authored-by: Kacper Łukawski <kacperlukawski@users.noreply.github.com>
1 parent 7de3a7e commit 227d03f

1 file changed

Lines changed: 112 additions & 0 deletions

File tree

integrations/presidio.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
layout: integration
3+
name: Presidio
4+
description: PII detection and anonymization for Haystack Documents and text strings, powered by Microsoft Presidio.
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
- name: Shahmeer Ali
12+
socials:
13+
github: SyedShahmeerAli12
14+
pypi: https://pypi.org/project/presidio-haystack/
15+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio
16+
type: Custom Component
17+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
18+
logo: /logos/microsoft.png
19+
version: Haystack 2.0
20+
toc: true
21+
---
22+
23+
### Table of Contents
24+
25+
- [Overview](#overview)
26+
- [Installation](#installation)
27+
- [Usage](#usage)
28+
- [Document Cleaning](#document-cleaning)
29+
- [Text Cleaning](#text-cleaning)
30+
- [Entity Extraction](#entity-extraction)
31+
- [License](#license)
32+
33+
## Overview
34+
35+
[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source library for PII detection and anonymization using NLP-based entity recognition.
36+
37+
`presidio-haystack` provides three Haystack components:
38+
39+
| Component | Input | Purpose |
40+
|-----------|-------|---------|
41+
| `PresidioDocumentCleaner` | `list[Document]` | Replace PII in document text with entity type placeholders |
42+
| `PresidioTextCleaner` | `list[str]` | Replace PII in plain strings — useful for sanitizing user queries |
43+
| `PresidioEntityExtractor` | `list[Document]` | Detect PII and store entities as structured document metadata |
44+
45+
All components run locally — no external API required. Presidio uses spaCy NLP models under the hood.
46+
47+
## Installation
48+
49+
```bash
50+
pip install presidio-haystack
51+
```
52+
53+
`en_core_web_lg` is the recommended English model for best accuracy. For a lighter footprint, `en_core_web_sm` works too — see the [full list of spaCy models](https://spacy.io/models/en) for options.
54+
55+
Each component accepts a `language` parameter (default `"en"`). To use a non-English language, specify the language code, and provide a model mapping, unless you want to use the large one.
56+
57+
58+
## Usage
59+
60+
### Document Cleaning
61+
62+
Replace PII in document content before indexing:
63+
64+
```python
65+
from haystack import Document
66+
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
67+
68+
cleaner = PresidioDocumentCleaner()
69+
result = cleaner.run(documents=[
70+
Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
71+
])
72+
print(result["documents"][0].content)
73+
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.
74+
```
75+
76+
Original documents are not mutated. Documents with no text content pass through unchanged.
77+
78+
### Text Cleaning
79+
80+
Sanitize user queries before they reach your LLM:
81+
82+
```python
83+
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
84+
85+
cleaner = PresidioTextCleaner()
86+
result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"])
87+
print(result["texts"][0])
88+
# My name is <PERSON>, my SSN is <US_SSN>
89+
```
90+
91+
### Entity Extraction
92+
93+
Detect PII and attach it as structured metadata without modifying the document text:
94+
95+
```python
96+
from haystack import Document
97+
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
98+
99+
extractor = PresidioEntityExtractor()
100+
result = extractor.run(documents=[
101+
Document(content="Contact Alice at alice@example.com")
102+
])
103+
print(result["documents"][0].meta["entities"])
104+
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
105+
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
106+
```
107+
108+
All three components accept `language`, `entities`, and `score_threshold` parameters at init time. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types.
109+
110+
## License
111+
112+
`presidio-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)