Skip to content

feat: Add better language support presidio#3209

Merged
sjrl merged 11 commits intomainfrom
add-language-support-presidio
Apr 24, 2026
Merged

feat: Add better language support presidio#3209
sjrl merged 11 commits intomainfrom
add-language-support-presidio

Conversation

@sjrl
Copy link
Copy Markdown
Contributor

@sjrl sjrl commented Apr 23, 2026

Related Issues

  • fixes #issue-number

Proposed Changes:

Update the presidio integration to properly pass language to the supported_languages param of AnalyzerEngine and add a new param called models which is required if using languages other than English to tell Presidio what model to use to power the NlpEngine.

For example, to run the document cleaner in german this now works and the german model is automatically downloaded at warm up time.

from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner(
    language="de",
    models=[{"lang_code": "de", "model_name": "de_core_news_lg"}],
)
cleaner.warm_up()
docs = [Document(content="Mein Name ist Hans Müller und meine E-Mail ist hans@example.com")]
result = cleaner.run(documents=docs)

How did you test it?

Added unit and integration tests.

Notes for the reviewer

Checklist

@sjrl sjrl requested a review from a team as a code owner April 23, 2026 07:56
@sjrl sjrl requested review from bogdankostic and removed request for a team April 23, 2026 07:56
@sjrl sjrl self-assigned this Apr 23, 2026
@github-actions github-actions Bot added integration:presidio type:documentation Improvements or additions to documentation labels Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 23, 2026

Coverage report (presidio)

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  integrations/presidio/src/haystack_integrations/components/common/presidio
  utils.py
  integrations/presidio/src/haystack_integrations/components/extractors/presidio
  presidio_entity_extractor.py 101
  integrations/presidio/src/haystack_integrations/components/preprocessors/presidio
  presidio_document_cleaner.py 99
  presidio_text_cleaner.py 96
Project Total  

This report was generated by python-coverage-comment-action

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle, I'm just wondering if we should remove the language parameter, given that if a user wants to use a non-english model, they have to set the models parameter. In that case self.language is not used to initialize the AnalyzerEngine.

@sjrl
Copy link
Copy Markdown
Contributor Author

sjrl commented Apr 23, 2026

Looks good in principle, I'm just wondering if we should remove the language parameter, given that if a user wants to use a non-english model, they have to set the models parameter. In that case self.language is not used to initialize the AnalyzerEngine.

Maybe instead we could add a set of pre-defined mappings between language and model so users could just use the language param?

@bogdankostic
Copy link
Copy Markdown
Contributor

Looks good in principle, I'm just wondering if we should remove the language parameter, given that if a user wants to use a non-english model, they have to set the models parameter. In that case self.language is not used to initialize the AnalyzerEngine.

Maybe instead we could add a set of pre-defined mappings between language and model so users could just use the language param?

SOunds good, that way users wouldn't need to figure out which model to use for their language.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me now! Just added a minor suggestion to improve the doc string of the class var.

sjrl and others added 3 commits April 24, 2026 13:40
…ractors/presidio/presidio_entity_extractor.py

Co-authored-by: bogdankostic <bogdankostic@web.de>
…processors/presidio/presidio_document_cleaner.py

Co-authored-by: bogdankostic <bogdankostic@web.de>
…processors/presidio/presidio_text_cleaner.py

Co-authored-by: bogdankostic <bogdankostic@web.de>
@sjrl sjrl merged commit 6dba45b into main Apr 24, 2026
15 checks passed
@sjrl sjrl deleted the add-language-support-presidio branch April 24, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:presidio type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants