@@ -36,28 +36,49 @@ print(result["documents"][0].meta["entities"])
3636# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
3737```
3838
39+ #### SPACY_DEFAULT_MODELS
40+
41+ ``` python
42+ SPACY_DEFAULT_MODELS : dict[str , str ] = _SPACY_DEFAULT_MODELS
43+ ```
44+
45+ Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
46+
47+ Used to automatically select an NLP model when ` models ` is not specified.
48+ See [ spaCy documentation] ( https://spacy.io/models ) for the full list of available spaCy models.
49+
3950#### __ init__
4051
4152``` python
4253__init__ (
4354 * ,
4455 language: str = " en" ,
4556 entities: list[str ] | None = None ,
46- score_threshold: float = 0.35
57+ score_threshold: float = 0.35 ,
58+ models: list[dict[str , str ]] | None = None
4759) -> None
4860```
4961
5062Initializes the PresidioEntityExtractor.
5163
5264** Parameters:**
5365
54- - ** language** (<code >str</code >) – Language code for PII detection. Defaults to ` "en" ` .
66+ - ** language** (<code >str</code >) – ISO 639-1 language code for PII detection. Defaults to ` "en" ` .
67+ For languages in the built-in mapping (e.g. ` "de" ` , ` "fr" ` , ` "es" ` ), the appropriate
68+ spaCy model is loaded automatically at warm-up time — no need to set ` models ` .
69+ For unsupported languages, use the ` models ` parameter to configure a custom model.
5570 See [ Presidio supported languages] ( https://microsoft.github.io/presidio/analyzer/languages/ ) .
5671- ** entities** (<code >list\[ str\] | None</code >) – List of PII entity types to detect (e.g. ` ["PERSON", "EMAIL_ADDRESS"] ` ).
5772 If ` None ` , all supported entity types are detected.
5873 See [ Presidio supported entities] ( https://microsoft.github.io/presidio/supported_entities/ ) .
5974- ** score_threshold** (<code >float</code >) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to ` 0.35 ` .
6075 See [ Presidio analyzer documentation] ( https://microsoft.github.io/presidio/analyzer/ ) .
76+ - ** models** (<code >list\[ dict\[ str, str\]\] | None</code >) – Advanced override: list of spaCy model configurations.
77+ Each entry must contain ` "lang_code" ` and ` "model_name" ` keys,
78+ e.g. ` [{"lang_code": "fr", "model_name": "fr_core_news_md"}] ` .
79+ Use this only when you need a specific model variant or a language not covered by the
80+ built-in mapping. If ` None ` , the model is selected automatically from ` SPACY_DEFAULT_MODELS `
81+ based on ` language ` .
6182
6283#### warm_up
6384
@@ -114,28 +135,49 @@ print(result["documents"][0].content)
114135# My name is <PERSON> and my email is <EMAIL_ADDRESS>
115136```
116137
138+ #### SPACY_DEFAULT_MODELS
139+
140+ ``` python
141+ SPACY_DEFAULT_MODELS : dict[str , str ] = _SPACY_DEFAULT_MODELS
142+ ```
143+
144+ Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
145+
146+ Used to automatically select an NLP model when ` models ` is not specified.
147+ See [ spaCy documentation] ( https://spacy.io/models ) for the full list of available spaCy models.
148+
117149#### __ init__
118150
119151``` python
120152__init__ (
121153 * ,
122154 language: str = " en" ,
123155 entities: list[str ] | None = None ,
124- score_threshold: float = 0.35
156+ score_threshold: float = 0.35 ,
157+ models: list[dict[str , str ]] | None = None
125158) -> None
126159```
127160
128161Initializes the PresidioDocumentCleaner.
129162
130163** Parameters:**
131164
132- - ** language** (<code >str</code >) – Language code for PII detection. Defaults to ` "en" ` .
165+ - ** language** (<code >str</code >) – ISO 639-1 language code for PII detection. Defaults to ` "en" ` .
166+ For languages in the built-in mapping (e.g. ` "de" ` , ` "fr" ` , ` "es" ` ), the appropriate
167+ spaCy model is loaded automatically at warm-up time — no need to set ` models ` .
168+ For unsupported languages, use the ` models ` parameter to configure a custom model.
133169 See [ Presidio supported languages] ( https://microsoft.github.io/presidio/analyzer/languages/ ) .
134170- ** entities** (<code >list\[ str\] | None</code >) – List of PII entity types to detect and anonymize (e.g. ` ["PERSON", "EMAIL_ADDRESS"] ` ).
135171 If ` None ` , all supported entity types are used.
136172 See [ Presidio supported entities] ( https://microsoft.github.io/presidio/supported_entities/ ) .
137173- ** score_threshold** (<code >float</code >) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to ` 0.35 ` .
138174 See [ Presidio analyzer documentation] ( https://microsoft.github.io/presidio/analyzer/ ) .
175+ - ** models** (<code >list\[ dict\[ str, str\]\] | None</code >) – Advanced override: list of spaCy model configurations.
176+ Each entry must contain ` "lang_code" ` and ` "model_name" ` keys,
177+ e.g. ` [{"lang_code": "fr", "model_name": "fr_core_news_md"}] ` .
178+ Use this only when you need a specific model variant or a language not covered by the
179+ built-in mapping. If ` None ` , the model is selected automatically from ` SPACY_DEFAULT_MODELS `
180+ based on ` language ` .
139181
140182#### warm_up
141183
@@ -188,28 +230,49 @@ print(result["texts"][0])
188230# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
189231```
190232
233+ #### SPACY_DEFAULT_MODELS
234+
235+ ``` python
236+ SPACY_DEFAULT_MODELS : dict[str , str ] = _SPACY_DEFAULT_MODELS
237+ ```
238+
239+ Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
240+
241+ Used to automatically select an NLP model when ` models ` is not specified.
242+ See [ spaCy documentation] ( https://spacy.io/models ) for the full list of available spaCy models.
243+
191244#### __ init__
192245
193246``` python
194247__init__ (
195248 * ,
196249 language: str = " en" ,
197250 entities: list[str ] | None = None ,
198- score_threshold: float = 0.35
251+ score_threshold: float = 0.35 ,
252+ models: list[dict[str , str ]] | None = None
199253) -> None
200254```
201255
202256Initializes the PresidioTextCleaner.
203257
204258** Parameters:**
205259
206- - ** language** (<code >str</code >) – Language code for PII detection. Defaults to ` "en" ` .
260+ - ** language** (<code >str</code >) – ISO 639-1 language code for PII detection. Defaults to ` "en" ` .
261+ For languages in the built-in mapping (e.g. ` "de" ` , ` "fr" ` , ` "es" ` ), the appropriate
262+ spaCy model is loaded automatically at warm-up time — no need to set ` models ` .
263+ For unsupported languages, use the ` models ` parameter to configure a custom model.
207264 See [ Presidio supported languages] ( https://microsoft.github.io/presidio/analyzer/languages/ ) .
208265- ** entities** (<code >list\[ str\] | None</code >) – List of PII entity types to detect and anonymize (e.g. ` ["PERSON", "PHONE_NUMBER"] ` ).
209266 If ` None ` , all supported entity types are used.
210267 See [ Presidio supported entities] ( https://microsoft.github.io/presidio/supported_entities/ ) .
211268- ** score_threshold** (<code >float</code >) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to ` 0.35 ` .
212269 See [ Presidio analyzer documentation] ( https://microsoft.github.io/presidio/analyzer/ ) .
270+ - ** models** (<code >list\[ dict\[ str, str\]\] | None</code >) – Advanced override: list of spaCy model configurations.
271+ Each entry must contain ` "lang_code" ` and ` "model_name" ` keys,
272+ e.g. ` [{"lang_code": "fr", "model_name": "fr_core_news_md"}] ` .
273+ Use this only when you need a specific model variant or a language not covered by the
274+ built-in mapping. If ` None ` , the model is selected automatically from ` SPACY_DEFAULT_MODELS `
275+ based on ` language ` .
213276
214277#### warm_up
215278
0 commit comments