feat: add Presidio integration for PII detection and anonymization#3075
feat: add Presidio integration for PII detection and anonymization#3075SyedShahmeerAli12 wants to merge 9 commits intodeepset-ai:mainfrom
Conversation
|
@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution. |
|
▎ Hello @sjrl I've implemented all three components proposed in this issue: PresidioDocumentCleaner : anonymizes PII in list[Document] All CI checks are passing. Happy to make any changes based on your feedback |
|
@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:
|
081b9d3 to
82de46b
Compare
Implements three Haystack components using Microsoft Presidio: - PresidioDocumentCleaner: anonymizes PII in list[Document] - PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization) - PresidioEntityExtractor: detects PII entities and stores them in Document metadata
…d coverage entries
82de46b to
80c8c1d
Compare
|
Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place. |
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Outdated
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_document_cleaner.py
Outdated
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_entity_extractor.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_entity_extractor.py
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_entity_extractor.py
Outdated
Show resolved
Hide resolved
...residio/src/haystack_integrations/components/preprocessors/presidio/presidio_text_cleaner.py
Outdated
Show resolved
Hide resolved
...dio/src/haystack_integrations/components/preprocessors/presidio/presidio_entity_extractor.py
Outdated
Show resolved
Hide resolved
...residio/src/haystack_integrations/components/preprocessors/presidio/presidio_text_cleaner.py
Show resolved
Hide resolved
...residio/src/haystack_integrations/components/preprocessors/presidio/presidio_text_cleaner.py
Show resolved
Hide resolved
…, type hints, dataclasses.replace, doc links - Add keyword-only arguments (*, ) to all three component __init__ methods - Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models - Fix run() return types from dict[str, Any] to proper typed dicts - Use dataclasses.replace() in PresidioEntityExtractor instead of Document() - Add Presidio documentation links for language, entities, and score_threshold params - Update integration tests to call warm_up() before run() - Add missing _anonymizer mock in test_run_skips_on_error tests
|
|
|
||
| :param language: | ||
| Language code for PII detection. Defaults to `"en"`. | ||
| See [Presidio supported languages](https://microsoft.github.io/presidio/supported_languages/). |
There was a problem hiding this comment.
let me check
| if self._analyzer is None: | ||
| self._analyzer = AnalyzerEngine() | ||
| if self._anonymizer is None: | ||
| self._anonymizer = AnonymizerEngine() |
There was a problem hiding this comment.
We prefer to use the pattern
| if self._analyzer is None: | |
| self._analyzer = AnalyzerEngine() | |
| if self._anonymizer is None: | |
| self._anonymizer = AnonymizerEngine() | |
| if self._is_warmed_up: | |
| return | |
| self._analyzer = AnalyzerEngine() | |
| self._anonymizer = AnonymizerEngine() | |
| self._is_warmed_up = True |
and then add self._is_warmed_up = False to the init method
| """ | ||
| cleaned: list[Document] = [] |
There was a problem hiding this comment.
We auto-warm up our components at first run time.
| """ | |
| cleaned: list[Document] = [] | |
| """ | |
| if not self._is_warmed_up: | |
| self.warm_up() | |
| cleaned: list[Document] = [] |
|
|
||
| Documents without text content are passed through unchanged. | ||
|
|
||
| Call `warm_up()` before running this component to load the Presidio analyzer and anonymizer engines. |
There was a problem hiding this comment.
Once this comment is addressed https://github.com/deepset-ai/haystack-core-integrations/pull/3075/changes#r3062974319 this should be dropped. Or updated to explain that the engines are loaded on first run.
Closes #3063
What this adds
Three new Haystack components using Microsoft Presidio:
list[Document], returns new Documents without mutating inputslist[str], useful for sanitizing user queries before LLM calls"entities"keyUsage example