Skip to content

feat: add Presidio integration for PII detection and anonymization#3075

Open
SyedShahmeerAli12 wants to merge 9 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/presidio-integration
Open

feat: add Presidio integration for PII detection and anonymization#3075
SyedShahmeerAli12 wants to merge 9 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/presidio-integration

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented Apr 1, 2026

Closes #3063

What this adds

Three new Haystack components using Microsoft Presidio:

  • PresidioDocumentCleaner — anonymizes PII in list[Document], returns new Documents without mutating inputs
  • PresidioTextCleaner — anonymizes PII in list[str], useful for sanitizing user queries before LLM calls
  • PresidioEntityExtractor — detects PII entities and stores them in Document metadata under the "entities" key

Usage example

from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
from haystack import Document

cleaner = PresidioDocumentCleaner()

result = cleaner.run(
    documents=[Document(content="My name is shahhmeer, email: ashahmeer73@gmail.com")]
)

# → "My name is <PERSON>, email: <EMAIL_ADDRESS>"

@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner April 1, 2026 07:53
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from sjrl and removed request for a team April 1, 2026 07:53
@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 1, 2026
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 1, 2026

@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution.

Also make sure to run the linter and the tests

@sjrl sjrl self-assigned this Apr 1, 2026
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

▎ Hello @sjrl I've implemented all three components proposed in this issue:

PresidioDocumentCleaner : anonymizes PII in list[Document]
PresidioTextCleaner : anonymizes PII in list[str] for query sanitization
PresidioEntityExtractor : detects PII entities and stores them in Document metadata

All CI checks are passing. Happy to make any changes based on your feedback

@sjrl sjrl removed their assignment Apr 2, 2026
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 2, 2026

@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:

@SyedShahmeerAli12 SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 081b9d3 to 82de46b Compare April 2, 2026 14:40
Implements three Haystack components using Microsoft Presidio:
- PresidioDocumentCleaner: anonymizes PII in list[Document]
- PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization)
- PresidioEntityExtractor: detects PII entities and stores them in Document metadata
@SyedShahmeerAli12 SyedShahmeerAli12 force-pushed the feat/presidio-integration branch from 82de46b to 80c8c1d Compare April 2, 2026 14:45
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place.

…, type hints, dataclasses.replace, doc links

- Add keyword-only arguments (*, ) to all three component __init__ methods
- Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models
- Fix run() return types from dict[str, Any] to proper typed dicts
- Use dataclasses.replace() in PresidioEntityExtractor instead of Document()
- Add Presidio documentation links for language, entities, and score_threshold params
- Update integration tests to call warm_up() before run()
- Add missing _anonymizer mock in test_run_skips_on_error tests
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author


:param language:
Language code for PII detection. Defaults to `"en"`.
See [Presidio supported languages](https://microsoft.github.io/presidio/supported_languages/).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link leads to a 404

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check

Comment on lines +73 to +76
if self._analyzer is None:
self._analyzer = AnalyzerEngine()
if self._anonymizer is None:
self._anonymizer = AnonymizerEngine()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use the pattern

Suggested change
if self._analyzer is None:
self._analyzer = AnalyzerEngine()
if self._anonymizer is None:
self._anonymizer = AnonymizerEngine()
if self._is_warmed_up:
return
self._analyzer = AnalyzerEngine()
self._anonymizer = AnonymizerEngine()
self._is_warmed_up = True

and then add self._is_warmed_up = False to the init method

Comment on lines +87 to +88
"""
cleaned: list[Document] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We auto-warm up our components at first run time.

Suggested change
"""
cleaned: list[Document] = []
"""
if not self._is_warmed_up:
self.warm_up()
cleaned: list[Document] = []


Documents without text content are passed through unchanged.

Call `warm_up()` before running this component to load the Presidio analyzer and anonymizer engines.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this comment is addressed https://github.com/deepset-ai/haystack-core-integrations/pull/3075/changes#r3062974319 this should be dropped. Or updated to explain that the engines are loaded on first run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:presidio topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Presidio integration

2 participants