|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Presidio |
| 4 | +description: PII detection and anonymization for Haystack Documents and text strings, powered by Microsoft Presidio. |
| 5 | +authors: |
| 6 | + - name: deepset |
| 7 | + socials: |
| 8 | + github: deepset-ai |
| 9 | + twitter: deepset_ai |
| 10 | + linkedin: https://www.linkedin.com/company/deepset-ai/ |
| 11 | + - name: Shahmeer Ali |
| 12 | + socials: |
| 13 | + github: SyedShahmeerAli12 |
| 14 | +pypi: https://pypi.org/project/presidio-haystack/ |
| 15 | +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
| 16 | +type: Custom Component |
| 17 | +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues |
| 18 | +logo: /logos/microsoft.png |
| 19 | +version: Haystack 2.0 |
| 20 | +toc: true |
| 21 | +--- |
| 22 | + |
| 23 | +### Table of Contents |
| 24 | + |
| 25 | +- [Overview](#overview) |
| 26 | +- [Installation](#installation) |
| 27 | +- [Usage](#usage) |
| 28 | + - [Document Cleaning](#document-cleaning) |
| 29 | + - [Text Cleaning](#text-cleaning) |
| 30 | + - [Entity Extraction](#entity-extraction) |
| 31 | +- [License](#license) |
| 32 | + |
| 33 | +## Overview |
| 34 | + |
| 35 | +[Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source library for PII detection and anonymization using NLP-based entity recognition. |
| 36 | + |
| 37 | +`presidio-haystack` provides three Haystack components: |
| 38 | + |
| 39 | +| Component | Input | Purpose | |
| 40 | +|-----------|-------|---------| |
| 41 | +| `PresidioDocumentCleaner` | `list[Document]` | Replace PII in document text with entity type placeholders | |
| 42 | +| `PresidioTextCleaner` | `list[str]` | Replace PII in plain strings — useful for sanitizing user queries | |
| 43 | +| `PresidioEntityExtractor` | `list[Document]` | Detect PII and store entities as structured document metadata | |
| 44 | + |
| 45 | +All components run locally — no external API required. Presidio uses spaCy NLP models under the hood. |
| 46 | + |
| 47 | +## Installation |
| 48 | + |
| 49 | +```bash |
| 50 | +pip install presidio-haystack |
| 51 | +``` |
| 52 | + |
| 53 | +`en_core_web_lg` is the recommended English model for best accuracy. For a lighter footprint, `en_core_web_sm` works too — see the [full list of spaCy models](https://spacy.io/models/en) for options. |
| 54 | + |
| 55 | +Each component accepts a `language` parameter (default `"en"`). To use a non-English language, specify the language code, and provide a model mapping, unless you want to use the large one. |
| 56 | + |
| 57 | + |
| 58 | +## Usage |
| 59 | + |
| 60 | +### Document Cleaning |
| 61 | + |
| 62 | +Replace PII in document content before indexing: |
| 63 | + |
| 64 | +```python |
| 65 | +from haystack import Document |
| 66 | +from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner |
| 67 | + |
| 68 | +cleaner = PresidioDocumentCleaner() |
| 69 | +result = cleaner.run(documents=[ |
| 70 | + Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") |
| 71 | +]) |
| 72 | +print(result["documents"][0].content) |
| 73 | +# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>. |
| 74 | +``` |
| 75 | + |
| 76 | +Original documents are not mutated. Documents with no text content pass through unchanged. |
| 77 | + |
| 78 | +### Text Cleaning |
| 79 | + |
| 80 | +Sanitize user queries before they reach your LLM: |
| 81 | + |
| 82 | +```python |
| 83 | +from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner |
| 84 | + |
| 85 | +cleaner = PresidioTextCleaner() |
| 86 | +result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) |
| 87 | +print(result["texts"][0]) |
| 88 | +# My name is <PERSON>, my SSN is <US_SSN> |
| 89 | +``` |
| 90 | + |
| 91 | +### Entity Extraction |
| 92 | + |
| 93 | +Detect PII and attach it as structured metadata without modifying the document text: |
| 94 | + |
| 95 | +```python |
| 96 | +from haystack import Document |
| 97 | +from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor |
| 98 | + |
| 99 | +extractor = PresidioEntityExtractor() |
| 100 | +result = extractor.run(documents=[ |
| 101 | + Document(content="Contact Alice at alice@example.com") |
| 102 | +]) |
| 103 | +print(result["documents"][0].meta["entities"]) |
| 104 | +# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, |
| 105 | +# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] |
| 106 | +``` |
| 107 | + |
| 108 | +All three components accept `language`, `entities`, and `score_threshold` parameters at init time. See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. |
| 109 | + |
| 110 | +## License |
| 111 | + |
| 112 | +`presidio-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments