feat: Add better language support presidio#3209
Conversation
Coverage report (presidio)Click to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
bogdankostic
left a comment
There was a problem hiding this comment.
Looks good in principle, I'm just wondering if we should remove the language parameter, given that if a user wants to use a non-english model, they have to set the models parameter. In that case self.language is not used to initialize the AnalyzerEngine.
Maybe instead we could add a set of pre-defined mappings between language and model so users could just use the language param? |
SOunds good, that way users wouldn't need to figure out which model to use for their language. |
bogdankostic
left a comment
There was a problem hiding this comment.
Looking good to me now! Just added a minor suggestion to improve the doc string of the class var.
…ractors/presidio/presidio_entity_extractor.py Co-authored-by: bogdankostic <bogdankostic@web.de>
…processors/presidio/presidio_document_cleaner.py Co-authored-by: bogdankostic <bogdankostic@web.de>
…processors/presidio/presidio_text_cleaner.py Co-authored-by: bogdankostic <bogdankostic@web.de>
Related Issues
Proposed Changes:
Update the presidio integration to properly pass
languageto thesupported_languagesparam ofAnalyzerEngineand add a new param calledmodelswhich is required if using languages other than English to tell Presidio what model to use to power the NlpEngine.For example, to run the document cleaner in german this now works and the german model is automatically downloaded at warm up time.
How did you test it?
Added unit and integration tests.
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.