Add Presidio integration

## Summary and motivation

Presidio is an open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. 

## Detailed design

The idea is to have a new Haystack component called PresidioDocumentCleaner that accepts a list of Documents as input and transforms the text of each document's text into redacted output text. It should return a list of new documents and leave the inputs unchanged. Details on why we need to avoid inplace mutation are here: https://docs.haystack.deepset.ai/docs/custom-components#requirements
I believe we should have another component, PresidioEntityExtractor. With that approach PresidioEntityExtractor corresponds to a [Presidio Analyzer](https://microsoft.github.io/presidio/analyzer/) and PresidioDocumentCleaner corresponds to a [Presidio Anonymizer](https://microsoft.github.io/presidio/anonymizer/).

For example an input document with the the text

<img width="875" height="123" alt="Image" src="https://github.com/user-attachments/assets/82ebc0a3-0b38-4086-9605-c48a8863fdf3" />

should be transformed into an output document with the text

<img width="879" height="128" alt="Image" src="https://github.com/user-attachments/assets/5d47de3e-6f08-4180-b027-c7be8d28f6ee" />

Haystack recently added a similar integration with Tonic Textual: https://haystack.deepset.ai/integrations/tonic-textual and the components TonicTextualDocumentCleaner and TonicTextualEntityExtractor can serve as a reference for PresidioDocumentCleaner and PresidioEntityExtractor.
We should add a thrid component PresidioTextCleaner analogue of the PresidioDocumentCleaner. It should take strings as input instead of documents and also return strings instead of documents.

## Checklist

Follow https://github.com/deepset-ai/haystack-core-integrations/blob/main/CONTRIBUTING.md#create-a-new-integration if you would like to contribute. In particular, use the scaffolding script:

`python scripts/create_new_integration.py --name presidio --type preprocessors`

Ensure the following checklist is complete before closing this issue.

### Tasks
- [ ] The code is documented with docstrings and was merged in the `main` branch
- [ ] Docs are published at https://docs.haystack.deepset.ai/
- [ ] There is a Github workflow running the tests for the integration nightly and at every PR
- [ ] A new label named like `integration:<your integration name>` has been added to the list of labels for this [repository](https://github.com/deepset-ai/haystack-core-integrations/labels)
- [ ] The [labeler.yml](https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/labeler.yml) file has been updated
- [ ] The package has been released on PyPI
- [ ] An integration tile with a usage example has been added to https://github.com/deepset-ai/haystack-integrations
- [ ] The integration has been listed in the [Inventory section](https://github.com/deepset-ai/haystack-core-integrations#inventory) of this repo README
- [ ] The feature was announced through social media


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Presidio integration #3063

Summary and motivation

Detailed design

Checklist

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Presidio integration #3063

Description

Summary and motivation

Detailed design

Checklist

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions