Skip to content

Add Presidio integration #3063

@julian-risch

Description

@julian-risch

Summary and motivation

Presidio is an open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data.

Detailed design

The idea is to have a new Haystack component called PresidioDocumentCleaner that accepts a list of Documents as input and transforms the text of each document's text into redacted output text. It should return a list of new documents and leave the inputs unchanged. Details on why we need to avoid inplace mutation are here: https://docs.haystack.deepset.ai/docs/custom-components#requirements
I believe we should have another component, PresidioEntityExtractor. With that approach PresidioEntityExtractor corresponds to a Presidio Analyzer and PresidioDocumentCleaner corresponds to a Presidio Anonymizer.

For example an input document with the the text

Image

should be transformed into an output document with the text

Image

Haystack recently added a similar integration with Tonic Textual: https://haystack.deepset.ai/integrations/tonic-textual and the components TonicTextualDocumentCleaner and TonicTextualEntityExtractor can serve as a reference for PresidioDocumentCleaner and PresidioEntityExtractor.
We should add a thrid component PresidioTextCleaner analogue of the PresidioDocumentCleaner. It should take strings as input instead of documents and also return strings instead of documents.

Checklist

Follow https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration if you would like to contribute. In particular, use the scaffolding script:

python scripts/create_new_integration.py --name presidio --type preprocessors

Ensure the following checklist is complete before closing this issue.

Tasks

  • The code is documented with docstrings and was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A new label named like integration:<your integration name> has been added to the list of labels for this repository
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • An integration tile with a usage example has been added to https://github.com/deepset-ai/haystack-integrations
  • The integration has been listed in the Inventory section of this repo README
  • The feature was announced through social media

Metadata

Metadata

Assignees

Labels

P2contributions wanted!Looking for external contributionsnew integrationDiscuss the creation of a new integration in Core

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions