Skip to content

feat: add Presidio integration page#455

Merged
kacperlukawski merged 5 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/add-presidio-integration
Apr 24, 2026
Merged

feat: add Presidio integration page#455
kacperlukawski merged 5 commits intodeepset-ai:mainfrom
SyedShahmeerAli12:feat/add-presidio-integration

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

Summary

  • Adds integration tile for presidio-haystack with usage examples for all three components
  • Covers PresidioDocumentCleaner, PresidioTextCleaner, and PresidioEntityExtractor
  • Uses the existing microsoft.png logo (Presidio is a Microsoft open-source project)

Related issue: deepset-ai/haystack-core-integrations#3063

Test plan

  • Integration page renders correctly
  • Code examples match the actual component APIs

Adds integration tile for presidio-haystack with usage examples for
PresidioDocumentCleaner, PresidioTextCleaner, and PresidioEntityExtractor.

Related: deepset-ai/haystack-core-integrations#3063
@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 23, 2026

Hey @SyedShahmeerAli12 once deepset-ai/haystack#11165 is merged let's make sure to update this PR to reflect all of the same changes. E.g. the import path of the entity extractor changed.

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

@sjrl noted Will update this PR to reflect all the changes once #11165 is merged.

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 24, 2026

@sjrl noted Will update this PR to reflect all the changes once #11165 is merged.

@SyedShahmeerAli12 please update this PR when you get the chance

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

SyedShahmeerAli12 commented Apr 24, 2026

@sjrl updated PresidioEntityExtractor import path changed from preprocessors to extractors to reflect the changes from #11165.

Comment thread integrations/thunderbolt.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be removed, unrelated to presidio

Comment thread logos/thunderbolt.png Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be removed

@sjrl
Copy link
Copy Markdown
Contributor

sjrl commented Apr 24, 2026

@SyedShahmeerAli12 thanks! The final review needs to come from @deepset-ai/devrel

@kacperlukawski kacperlukawski self-requested a review April 24, 2026 13:10
Comment thread integrations/presidio.md Outdated

```bash
pip install presidio-haystack
python -m spacy download en_core_web_lg
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also document how to download the model in the application itself and link the available spacy models? For those who have never used spacy, it might be also worth describing how to set it up for other languages.

Out of curiosity - are we able to use small and large model in the same application? It seems each component accepts a language, but not the model name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also changed this slightly in this PR deepset-ai/haystack-core-integrations#3209 where we added a default mappings of languages to models as a ClassVar

Copy link
Copy Markdown
Contributor Author

@SyedShahmeerAli12 SyedShahmeerAli12 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, good questions! I've updated the Installation section to cover all of this. Added a note
about en_core_web_sm as a lighter option (with a link to the full spaCy models list), plus a
quick example showing how to set things up for a non-English language.

On your question about sm vs lg each language value maps to one spaCy model at a time,
whichever is registered in your environment, so there's no way to pick between them per
component.

Comment thread integrations/presidio.md
Comment on lines +5 to +10
authors:
- name: deepset
socials:
github: deepset-ai
twitter: deepset_ai
linkedin: https://www.linkedin.com/company/deepset-ai/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SyedShahmeerAli12 Don't you want to add yourself as an author?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review .. I've addressed all the feedback ....

  • Removed the unrelated thunderbolt.md and thunderbolt.png files
  • Added myself as co-author in the front-matter
  • Expanded the Installation section with:
    • en_core_web_sm as a lighter alternative, with a link to the full spaCy models list
    • An example showing how to download and use a non-English model (e.g. Spanish)
    • A note answering your question about sm vs lg: each language maps to one spaCy model at a
      time whichever is registered in the environment is used, so you can't pick between sm and lg per component

- Add Shahmeer Ali as co-author
- Remove unrelated thunderbolt files
- Expand installation section with spaCy model guidance, language support note, and sm vs lg clarification
Comment thread integrations/presidio.md Outdated
Comment thread integrations/presidio.md Outdated
Co-authored-by: Kacper Łukawski <kacperlukawski@users.noreply.github.com>
@kacperlukawski kacperlukawski merged commit 227d03f into deepset-ai:main Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants