feat: add Presidio integration for PII detection and anonymization#3075
Conversation
|
@SyedShahmeerAli12 thanks for contributing! Please read the contribution guidelines here https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration and run the scaffolding script to help fill in some missing aspects of your contribution. |
|
▎ Hello @sjrl I've implemented all three components proposed in this issue: PresidioDocumentCleaner : anonymizes PII in list[Document] All CI checks are passing. Happy to make any changes based on your feedback |
|
@SyedShahmeerAli12 a few more high-level comments before I do an indepth review:
|
081b9d3 to
82de46b
Compare
Implements three Haystack components using Microsoft Presidio: - PresidioDocumentCleaner: anonymizes PII in list[Document] - PresidioTextCleaner: anonymizes PII in list[str] (for query sanitization) - PresidioEntityExtractor: detects PII entities and stores them in Document metadata
…d coverage entries
82de46b to
80c8c1d
Compare
|
Addressed all comments removed the ## Contributing header from README to match the pgvector format, fixed alphabetical ordering in both the root README table and CI_coverage_comment.yml. Python 3.14 was already in place. |
…, type hints, dataclasses.replace, doc links - Add keyword-only arguments (*, ) to all three component __init__ methods - Move AnalyzerEngine/AnonymizerEngine initialization to warm_up() since they load spaCy ML models - Fix run() return types from dict[str, Any] to proper typed dicts - Use dataclasses.replace() in PresidioEntityExtractor instead of Document() - Add Presidio documentation links for language, entities, and score_threshold params - Update integration tests to call warm_up() before run() - Add missing _anonymizer mock in test_run_skips_on_error tests
|
- Add `_is_warmed_up` guard to `warm_up()` so repeated calls are idempotent - Auto-warm on first `run()` call instead of raising RuntimeError - Update component docstrings to reflect lazy loading behavior - Fix broken Presidio doc link (supported_languages → analyzer/languages) - Add `_make_*_with_mocks()` helper in each test class to centralize mock setup and prevent auto-warm from overwriting injected mocks
|
Thanks @sjrl Addressed all four points:
|
02ea61c to
10d90f5
Compare
- Regenerate presidio.yml workflow from template (compute-test-matrix job, pinned action versions, push trigger, coverage steps) - Add integration-cov-append-retry script to pyproject.toml - Drop un-anonymized documents on error in PresidioDocumentCleaner instead of passing them through unanonymized - Clarify warning message in PresidioEntityExtractor to say extraction is skipped but document is kept - Update test to assert failed docs are dropped in PresidioDocumentCleaner
|
Hi @sjrl, addressed all three comments:
|
|
Thanks @SyedShahmeerAli12 ! Almost there, could you look at the failing CI check here |
|
@sjrl all checks are passing now. Ready for another review when you get a chance! |
sjrl
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
Closes #3063
What this adds
Three new Haystack components using Microsoft Presidio:
list[Document], returns new Documents without mutating inputslist[str], useful for sanitizing user queries before LLM calls"entities"keyUsage example