Skip to content

Commit edf6ac3

Browse files
HaystackBotsjrl
andauthored
docs: sync Core Integrations API reference (presidio) on Docusaurus (#11187)
Co-authored-by: sjrl <10526848+sjrl@users.noreply.github.com>
1 parent 08a029a commit edf6ac3

12 files changed

Lines changed: 828 additions & 72 deletions

File tree

docs-website/reference/integrations-api/presidio.md

Lines changed: 69 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,28 +36,49 @@ print(result["documents"][0].meta["entities"])
3636
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
3737
```
3838

39+
#### SPACY_DEFAULT_MODELS
40+
41+
```python
42+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
43+
```
44+
45+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
46+
47+
Used to automatically select an NLP model when `models` is not specified.
48+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
49+
3950
#### __init__
4051

4152
```python
4253
__init__(
4354
*,
4455
language: str = "en",
4556
entities: list[str] | None = None,
46-
score_threshold: float = 0.35
57+
score_threshold: float = 0.35,
58+
models: list[dict[str, str]] | None = None
4759
) -> None
4860
```
4961

5062
Initializes the PresidioEntityExtractor.
5163

5264
**Parameters:**
5365

54-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
66+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
67+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
68+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
69+
For unsupported languages, use the `models` parameter to configure a custom model.
5570
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
5671
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
5772
If `None`, all supported entity types are detected.
5873
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
5974
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
6075
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
76+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
77+
Each entry must contain `"lang_code"` and `"model_name"` keys,
78+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
79+
Use this only when you need a specific model variant or a language not covered by the
80+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
81+
based on `language`.
6182

6283
#### warm_up
6384

@@ -114,28 +135,49 @@ print(result["documents"][0].content)
114135
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
115136
```
116137

138+
#### SPACY_DEFAULT_MODELS
139+
140+
```python
141+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
142+
```
143+
144+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
145+
146+
Used to automatically select an NLP model when `models` is not specified.
147+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
148+
117149
#### __init__
118150

119151
```python
120152
__init__(
121153
*,
122154
language: str = "en",
123155
entities: list[str] | None = None,
124-
score_threshold: float = 0.35
156+
score_threshold: float = 0.35,
157+
models: list[dict[str, str]] | None = None
125158
) -> None
126159
```
127160

128161
Initializes the PresidioDocumentCleaner.
129162

130163
**Parameters:**
131164

132-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
165+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
166+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
167+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
168+
For unsupported languages, use the `models` parameter to configure a custom model.
133169
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
134170
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
135171
If `None`, all supported entity types are used.
136172
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
137173
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
138174
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
175+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
176+
Each entry must contain `"lang_code"` and `"model_name"` keys,
177+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
178+
Use this only when you need a specific model variant or a language not covered by the
179+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
180+
based on `language`.
139181

140182
#### warm_up
141183

@@ -188,28 +230,49 @@ print(result["texts"][0])
188230
# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
189231
```
190232

233+
#### SPACY_DEFAULT_MODELS
234+
235+
```python
236+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
237+
```
238+
239+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
240+
241+
Used to automatically select an NLP model when `models` is not specified.
242+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
243+
191244
#### __init__
192245

193246
```python
194247
__init__(
195248
*,
196249
language: str = "en",
197250
entities: list[str] | None = None,
198-
score_threshold: float = 0.35
251+
score_threshold: float = 0.35,
252+
models: list[dict[str, str]] | None = None
199253
) -> None
200254
```
201255

202256
Initializes the PresidioTextCleaner.
203257

204258
**Parameters:**
205259

206-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
260+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
261+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
262+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
263+
For unsupported languages, use the `models` parameter to configure a custom model.
207264
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
208265
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`).
209266
If `None`, all supported entity types are used.
210267
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
211268
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
212269
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
270+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
271+
Each entry must contain `"lang_code"` and `"model_name"` keys,
272+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
273+
Use this only when you need a specific model variant or a language not covered by the
274+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
275+
based on `language`.
213276

214277
#### warm_up
215278

docs-website/reference_versioned_docs/version-2.18/integrations-api/presidio.md

Lines changed: 69 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,28 +36,49 @@ print(result["documents"][0].meta["entities"])
3636
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
3737
```
3838

39+
#### SPACY_DEFAULT_MODELS
40+
41+
```python
42+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
43+
```
44+
45+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
46+
47+
Used to automatically select an NLP model when `models` is not specified.
48+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
49+
3950
#### __init__
4051

4152
```python
4253
__init__(
4354
*,
4455
language: str = "en",
4556
entities: list[str] | None = None,
46-
score_threshold: float = 0.35
57+
score_threshold: float = 0.35,
58+
models: list[dict[str, str]] | None = None
4759
) -> None
4860
```
4961

5062
Initializes the PresidioEntityExtractor.
5163

5264
**Parameters:**
5365

54-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
66+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
67+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
68+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
69+
For unsupported languages, use the `models` parameter to configure a custom model.
5570
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
5671
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
5772
If `None`, all supported entity types are detected.
5873
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
5974
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
6075
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
76+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
77+
Each entry must contain `"lang_code"` and `"model_name"` keys,
78+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
79+
Use this only when you need a specific model variant or a language not covered by the
80+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
81+
based on `language`.
6182

6283
#### warm_up
6384

@@ -114,28 +135,49 @@ print(result["documents"][0].content)
114135
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
115136
```
116137

138+
#### SPACY_DEFAULT_MODELS
139+
140+
```python
141+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
142+
```
143+
144+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
145+
146+
Used to automatically select an NLP model when `models` is not specified.
147+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
148+
117149
#### __init__
118150

119151
```python
120152
__init__(
121153
*,
122154
language: str = "en",
123155
entities: list[str] | None = None,
124-
score_threshold: float = 0.35
156+
score_threshold: float = 0.35,
157+
models: list[dict[str, str]] | None = None
125158
) -> None
126159
```
127160

128161
Initializes the PresidioDocumentCleaner.
129162

130163
**Parameters:**
131164

132-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
165+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
166+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
167+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
168+
For unsupported languages, use the `models` parameter to configure a custom model.
133169
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
134170
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
135171
If `None`, all supported entity types are used.
136172
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
137173
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
138174
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
175+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
176+
Each entry must contain `"lang_code"` and `"model_name"` keys,
177+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
178+
Use this only when you need a specific model variant or a language not covered by the
179+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
180+
based on `language`.
139181

140182
#### warm_up
141183

@@ -188,28 +230,49 @@ print(result["texts"][0])
188230
# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
189231
```
190232

233+
#### SPACY_DEFAULT_MODELS
234+
235+
```python
236+
SPACY_DEFAULT_MODELS: dict[str, str] = _SPACY_DEFAULT_MODELS
237+
```
238+
239+
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
240+
241+
Used to automatically select an NLP model when `models` is not specified.
242+
See [spaCy documentation](https://spacy.io/models) for the full list of available spaCy models.
243+
191244
#### __init__
192245

193246
```python
194247
__init__(
195248
*,
196249
language: str = "en",
197250
entities: list[str] | None = None,
198-
score_threshold: float = 0.35
251+
score_threshold: float = 0.35,
252+
models: list[dict[str, str]] | None = None
199253
) -> None
200254
```
201255

202256
Initializes the PresidioTextCleaner.
203257

204258
**Parameters:**
205259

206-
- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
260+
- **language** (<code>str</code>) – ISO 639-1 language code for PII detection. Defaults to `"en"`.
261+
For languages in the built-in mapping (e.g. `"de"`, `"fr"`, `"es"`), the appropriate
262+
spaCy model is loaded automatically at warm-up time — no need to set `models`.
263+
For unsupported languages, use the `models` parameter to configure a custom model.
207264
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
208265
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "PHONE_NUMBER"]`).
209266
If `None`, all supported entity types are used.
210267
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
211268
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
212269
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).
270+
- **models** (<code>list\[dict\[str, str\]\] | None</code>) – Advanced override: list of spaCy model configurations.
271+
Each entry must contain `"lang_code"` and `"model_name"` keys,
272+
e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`.
273+
Use this only when you need a specific model variant or a language not covered by the
274+
built-in mapping. If `None`, the model is selected automatically from `SPACY_DEFAULT_MODELS`
275+
based on `language`.
213276

214277
#### warm_up
215278

0 commit comments

Comments
 (0)