Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 46 additions & 46 deletions docs-website/reference/integrations-api/presidio.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,34 @@ slug: "/integrations-presidio"
---


## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
## haystack_integrations.components.extractors.presidio.presidio_entity_extractor

### PresidioDocumentCleaner
### PresidioEntityExtractor

Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).
Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.

Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.
See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.

Documents without text content are passed through unchanged.
Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key `"entities"`. Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.

The analyzer and anonymizer engines are loaded on the first call to `run()`,
Original Documents are not mutated. Documents without text content are passed through unchanged.

The analyzer engine is loaded on the first call to `run()`,
or by calling `warm_up()` explicitly beforehand.

### Usage example

```python
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
```

#### __init__
Expand All @@ -44,16 +47,16 @@ __init__(
) -> None
```

Initializes the PresidioDocumentCleaner.
Initializes the PresidioEntityExtractor.

**Parameters:**

- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are used.
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are detected.
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).

#### warm_up
Expand All @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner.
warm_up() -> None
```

Initializes the Presidio analyzer and anonymizer engines.
Initializes the Presidio analyzer engine.

This method loads the underlying NLP models. In a Haystack Pipeline,
this is called automatically before the first run.
Expand All @@ -73,44 +76,42 @@ this is called automatically before the first run.
run(documents: list[Document]) -> dict[str, list[Document]]
```

Anonymizes PII in the provided Documents.
Detects PII entities in the provided Documents.

**Parameters:**

- **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.
- **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.

**Returns:**

- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.

## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
stored in metadata under the key `"entities"`.

### PresidioEntityExtractor
## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner

Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
### PresidioDocumentCleaner

See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.
Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).

Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key `"entities"`. Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.
Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.

Original Documents are not mutated. Documents without text content are passed through unchanged.
Documents without text content are passed through unchanged.

The analyzer engine is loaded on the first call to `run()`,
The analyzer and anonymizer engines are loaded on the first call to `run()`,
or by calling `warm_up()` explicitly beforehand.

### Usage example

```python
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
```

#### __init__
Expand All @@ -124,16 +125,16 @@ __init__(
) -> None
```

Initializes the PresidioEntityExtractor.
Initializes the PresidioDocumentCleaner.

**Parameters:**

- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are detected.
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are used.
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).

#### warm_up
Expand All @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor.
warm_up() -> None
```

Initializes the Presidio analyzer engine.
Initializes the Presidio analyzer and anonymizer engines.

This method loads the underlying NLP models. In a Haystack Pipeline,
this is called automatically before the first run.
Expand All @@ -153,16 +154,15 @@ this is called automatically before the first run.
run(documents: list[Document]) -> dict[str, list[Document]]
```

Detects PII entities in the provided Documents.
Anonymizes PII in the provided Documents.

**Parameters:**

- **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.
- **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.

**Returns:**

- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
stored in metadata under the key `"entities"`.
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.

## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,34 @@ slug: "/integrations-presidio"
---


## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
## haystack_integrations.components.extractors.presidio.presidio_entity_extractor

### PresidioDocumentCleaner
### PresidioEntityExtractor

Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).
Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.

Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.
See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.

Documents without text content are passed through unchanged.
Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key `"entities"`. Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.

The analyzer and anonymizer engines are loaded on the first call to `run()`,
Original Documents are not mutated. Documents without text content are passed through unchanged.

The analyzer engine is loaded on the first call to `run()`,
or by calling `warm_up()` explicitly beforehand.

### Usage example

```python
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
```

#### __init__
Expand All @@ -44,16 +47,16 @@ __init__(
) -> None
```

Initializes the PresidioDocumentCleaner.
Initializes the PresidioEntityExtractor.

**Parameters:**

- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are used.
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are detected.
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).

#### warm_up
Expand All @@ -62,7 +65,7 @@ Initializes the PresidioDocumentCleaner.
warm_up() -> None
```

Initializes the Presidio analyzer and anonymizer engines.
Initializes the Presidio analyzer engine.

This method loads the underlying NLP models. In a Haystack Pipeline,
this is called automatically before the first run.
Expand All @@ -73,44 +76,42 @@ this is called automatically before the first run.
run(documents: list[Document]) -> dict[str, list[Document]]
```

Anonymizes PII in the provided Documents.
Detects PII entities in the provided Documents.

**Parameters:**

- **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.
- **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.

**Returns:**

- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.

## haystack_integrations.components.preprocessors.presidio.presidio_entity_extractor
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
stored in metadata under the key `"entities"`.

### PresidioEntityExtractor
## haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner

Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
### PresidioDocumentCleaner

See [Presidio Analyzer](https://microsoft.github.io/presidio/) for details.
Anonymizes PII in Haystack Documents using [Microsoft Presidio](https://microsoft.github.io/presidio/).

Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key `"entities"`. Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.
Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. `<PERSON>`, `<EMAIL_ADDRESS>`). Original Documents are not mutated.

Original Documents are not mutated. Documents without text content are passed through unchanged.
Documents without text content are passed through unchanged.

The analyzer engine is loaded on the first call to `run()`,
The analyzer and anonymizer engines are loaded on the first call to `run()`,
or by calling `warm_up()` explicitly beforehand.

### Usage example

```python
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioEntityExtractor
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
```

#### __init__
Expand All @@ -124,16 +125,16 @@ __init__(
) -> None
```

Initializes the PresidioEntityExtractor.
Initializes the PresidioDocumentCleaner.

**Parameters:**

- **language** (<code>str</code>) – Language code for PII detection. Defaults to `"en"`.
See [Presidio supported languages](https://microsoft.github.io/presidio/analyzer/languages/).
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are detected.
- **entities** (<code>list\[str\] | None</code>) – List of PII entity types to detect and anonymize (e.g. `["PERSON", "EMAIL_ADDRESS"]`).
If `None`, all supported entity types are used.
See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/).
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to `0.35`.
- **score_threshold** (<code>float</code>) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to `0.35`.
See [Presidio analyzer documentation](https://microsoft.github.io/presidio/analyzer/).

#### warm_up
Expand All @@ -142,7 +143,7 @@ Initializes the PresidioEntityExtractor.
warm_up() -> None
```

Initializes the Presidio analyzer engine.
Initializes the Presidio analyzer and anonymizer engines.

This method loads the underlying NLP models. In a Haystack Pipeline,
this is called automatically before the first run.
Expand All @@ -153,16 +154,15 @@ this is called automatically before the first run.
run(documents: list[Document]) -> dict[str, list[Document]]
```

Detects PII entities in the provided Documents.
Anonymizes PII in the provided Documents.

**Parameters:**

- **documents** (<code>list\[Document\]</code>) – List of Documents to analyze for PII entities.
- **documents** (<code>list\[Document\]</code>) – List of Documents whose text content will be anonymized.

**Returns:**

- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing Documents with detected entities
stored in metadata under the key `"entities"`.
- <code>dict\[str, list\[Document\]\]</code> – A dictionary with key `documents` containing the cleaned Documents.

## haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner

Expand Down
Loading
Loading