feat: add Presidio integration page#455
Conversation
Adds integration tile for presidio-haystack with usage examples for PresidioDocumentCleaner, PresidioTextCleaner, and PresidioEntityExtractor. Related: deepset-ai/haystack-core-integrations#3063
|
Hey @SyedShahmeerAli12 once deepset-ai/haystack#11165 is merged let's make sure to update this PR to reflect all of the same changes. E.g. the import path of the entity extractor changed. |
|
@sjrl noted Will update this PR to reflect all the changes once #11165 is merged. |
@SyedShahmeerAli12 please update this PR when you get the chance |
|
@sjrl updated |
There was a problem hiding this comment.
this file should be removed, unrelated to presidio
|
@SyedShahmeerAli12 thanks! The final review needs to come from @deepset-ai/devrel |
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| python -m spacy download en_core_web_lg |
There was a problem hiding this comment.
Could we also document how to download the model in the application itself and link the available spacy models? For those who have never used spacy, it might be also worth describing how to set it up for other languages.
Out of curiosity - are we able to use small and large model in the same application? It seems each component accepts a language, but not the model name.
There was a problem hiding this comment.
We also changed this slightly in this PR deepset-ai/haystack-core-integrations#3209 where we added a default mappings of languages to models as a ClassVar
There was a problem hiding this comment.
Hey, good questions! I've updated the Installation section to cover all of this. Added a note
about en_core_web_sm as a lighter option (with a link to the full spaCy models list), plus a
quick example showing how to set things up for a non-English language.
On your question about sm vs lg each language value maps to one spaCy model at a time,
whichever is registered in your environment, so there's no way to pick between them per
component.
| authors: | ||
| - name: deepset | ||
| socials: | ||
| github: deepset-ai | ||
| twitter: deepset_ai | ||
| linkedin: https://www.linkedin.com/company/deepset-ai/ |
There was a problem hiding this comment.
@SyedShahmeerAli12 Don't you want to add yourself as an author?
There was a problem hiding this comment.
Thanks for the review .. I've addressed all the feedback ....
- Removed the unrelated thunderbolt.md and thunderbolt.png files
- Added myself as co-author in the front-matter
- Expanded the Installation section with:
- en_core_web_sm as a lighter alternative, with a link to the full spaCy models list
- An example showing how to download and use a non-English model (e.g. Spanish)
- A note answering your question about sm vs lg: each language maps to one spaCy model at a
time whichever is registered in the environment is used, so you can't pick between sm and lg per component
- Add Shahmeer Ali as co-author - Remove unrelated thunderbolt files - Expand installation section with spaCy model guidance, language support note, and sm vs lg clarification
Co-authored-by: Kacper Łukawski <kacperlukawski@users.noreply.github.com>
Summary
presidio-haystackwith usage examples for all three componentsPresidioDocumentCleaner,PresidioTextCleaner, andPresidioEntityExtractormicrosoft.pnglogo (Presidio is a Microsoft open-source project)Related issue: deepset-ai/haystack-core-integrations#3063
Test plan