Summary and motivation
Presidio is an open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data.
Detailed design
The idea is to have a new Haystack component called PresidioDocumentCleaner that accepts a list of Documents as input and transforms the text of each document's text into redacted output text. It should return a list of new documents and leave the inputs unchanged. Details on why we need to avoid inplace mutation are here: https://docs.haystack.deepset.ai/docs/custom-components#requirements
I believe we should have another component, PresidioEntityExtractor. With that approach PresidioEntityExtractor corresponds to a Presidio Analyzer and PresidioDocumentCleaner corresponds to a Presidio Anonymizer.
For example an input document with the the text
should be transformed into an output document with the text
Haystack recently added a similar integration with Tonic Textual: https://haystack.deepset.ai/integrations/tonic-textual and the components TonicTextualDocumentCleaner and TonicTextualEntityExtractor can serve as a reference for PresidioDocumentCleaner and PresidioEntityExtractor.
We should add a thrid component PresidioTextCleaner analogue of the PresidioDocumentCleaner. It should take strings as input instead of documents and also return strings instead of documents.
Checklist
Follow https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration if you would like to contribute. In particular, use the scaffolding script:
python scripts/create_new_integration.py --name presidio --type preprocessors
Ensure the following checklist is complete before closing this issue.
Tasks
Summary and motivation
Presidio is an open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data.
Detailed design
The idea is to have a new Haystack component called PresidioDocumentCleaner that accepts a list of Documents as input and transforms the text of each document's text into redacted output text. It should return a list of new documents and leave the inputs unchanged. Details on why we need to avoid inplace mutation are here: https://docs.haystack.deepset.ai/docs/custom-components#requirements
I believe we should have another component, PresidioEntityExtractor. With that approach PresidioEntityExtractor corresponds to a Presidio Analyzer and PresidioDocumentCleaner corresponds to a Presidio Anonymizer.
For example an input document with the the text
should be transformed into an output document with the text
Haystack recently added a similar integration with Tonic Textual: https://haystack.deepset.ai/integrations/tonic-textual and the components TonicTextualDocumentCleaner and TonicTextualEntityExtractor can serve as a reference for PresidioDocumentCleaner and PresidioEntityExtractor.
We should add a thrid component PresidioTextCleaner analogue of the PresidioDocumentCleaner. It should take strings as input instead of documents and also return strings instead of documents.
Checklist
Follow https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration if you would like to contribute. In particular, use the scaffolding script:
python scripts/create_new_integration.py --name presidio --type preprocessorsEnsure the following checklist is complete before closing this issue.
Tasks
mainbranchintegration:<your integration name>has been added to the list of labels for this repository