-
Notifications
You must be signed in to change notification settings - Fork 2.8k
docs: add Presidio component docs pages #11165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
sjrl
merged 7 commits into
deepset-ai:main
from
SyedShahmeerAli12:docs/presidio-components
Apr 24, 2026
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
0ba983e
docs: add Presidio preprocessors docs page
SyedShahmeerAli12 3a6b827
docs(presidio): split into per-component files and move extractor to …
SyedShahmeerAli12 d80028e
docs(presidio): sort preprocessors table alphabetically
SyedShahmeerAli12 d683e27
docs(presidio): explain spaCy model download in installation steps
SyedShahmeerAli12 c8c3be3
docs(presidio): restructure PresidioEntityExtractor page per review f…
SyedShahmeerAli12 2f7662f
docs(presidio): move installation into Usage section, drop standalone…
SyedShahmeerAli12 a2d584b
docs(presidio): add Overview, move Configuration before Usage, add su…
SyedShahmeerAli12 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
75 changes: 75 additions & 0 deletions
75
docs-website/docs/pipeline-components/extractors/presidioentityextractor.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| --- | ||
| title: "PresidioEntityExtractor" | ||
| id: presidioentityextractor | ||
| slug: "/presidioentityextractor" | ||
| description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioEntityExtractor | ||
|
|
||
| `PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII metadata added | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. | ||
|
|
||
| The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
| Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor() | ||
| result = extractor.run(documents=[ | ||
| Document(content="Contact Alice at alice@example.com") | ||
| ]) | ||
| print(result["documents"][0].meta["entities"]) | ||
| # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, | ||
| # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize entity detection, pass parameters when initializing the extractor: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
|
sjrl marked this conversation as resolved.
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
99 changes: 99 additions & 0 deletions
99
docs-website/docs/pipeline-components/preprocessors/presidiodocumentcleaner.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| --- | ||
| title: "PresidioDocumentCleaner" | ||
| id: presidiodocumentcleaner | ||
| slug: "/presidiodocumentcleaner" | ||
| description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioDocumentCleaner | ||
|
|
||
| `PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`. Original Documents are not mutated. Documents without text content pass through unchanged. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII replaced | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioDocumentCleaner` uses Presidio's Analyzer and Anonymizer engines to scan document text and replace detected entities with type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`. | ||
|
|
||
| This is useful when you want to store sanitized versions of your documents in a Document Store — for example, to prevent sensitive information from being indexed or returned in search results. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
| Install the `presidio-haystack` package to use the `PresidioDocumentCleaner`. | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the English NLP model required by Presidio's analyzer engine | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| cleaner = PresidioDocumentCleaner() | ||
| result = cleaner.run(documents=[ | ||
| Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") | ||
| ]) | ||
| print(result["documents"][0].content) | ||
| # Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>. | ||
| ``` | ||
|
|
||
| ### In a pipeline | ||
|
|
||
| ```python | ||
| from haystack import Document, Pipeline | ||
| from haystack.components.writers import DocumentWriter | ||
| from haystack.document_stores.in_memory import InMemoryDocumentStore | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| document_store = InMemoryDocumentStore() | ||
|
|
||
| indexing_pipeline = Pipeline() | ||
| indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) | ||
| indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) | ||
| indexing_pipeline.connect("cleaner", "writer") | ||
|
|
||
| indexing_pipeline.run({ | ||
| "cleaner": { | ||
| "documents": [ | ||
| Document(content="Alice Smith's email is alice@example.com"), | ||
| Document(content="Call Bob at 212-555-9876"), | ||
| ] | ||
| } | ||
| }) | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize PII detection, pass parameters when initializing the cleaner: | ||
|
|
||
| ```python | ||
|
sjrl marked this conversation as resolved.
|
||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| cleaner = PresidioDocumentCleaner( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
94 changes: 94 additions & 0 deletions
94
docs-website/docs/pipeline-components/preprocessors/presidiotextcleaner.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| --- | ||
| title: "PresidioTextCleaner" | ||
| id: presidiotextcleaner | ||
| slug: "/presidiotextcleaner" | ||
| description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioTextCleaner | ||
|
|
||
| `PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | | ||
| | **Mandatory run variables** | `texts`: A list of strings | | ||
| | **Output variables** | `texts`: A list of strings with PII replaced | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioTextCleaner` uses Presidio's Analyzer and Anonymizer engines to scan plain text strings and replace detected entities with type placeholders such as `<PERSON>` or `<US_SSN>`. | ||
|
|
||
| This is useful when you want to sanitize user queries before sending them to an LLM, ensuring that no personally identifiable information is passed to the model. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
| Install the `presidio-haystack` package to use the `PresidioTextCleaner`. | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the English NLP model required by Presidio's analyzer engine | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| cleaner = PresidioTextCleaner() | ||
| result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) | ||
| print(result["texts"][0]) | ||
| # My name is <PERSON>, my SSN is <US_SSN> | ||
| ``` | ||
|
|
||
| ### In a pipeline | ||
|
|
||
| ```python | ||
| from haystack import Pipeline | ||
| from haystack.components.builders import ChatPromptBuilder | ||
| from haystack.components.generators.chat import OpenAIChatGenerator | ||
| from haystack.dataclasses import ChatMessage | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| template = [ChatMessage.from_user("Answer this question: {{query}}")] | ||
|
|
||
| query_pipeline = Pipeline() | ||
| query_pipeline.add_component("cleaner", PresidioTextCleaner()) | ||
| query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) | ||
| query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) | ||
| query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") | ||
| query_pipeline.connect("prompt_builder", "llm") | ||
|
|
||
| query_pipeline.run({ | ||
| "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} | ||
| }) | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize PII detection, pass parameters when initializing the cleaner: | ||
|
|
||
| ```python | ||
|
sjrl marked this conversation as resolved.
|
||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| cleaner = PresidioTextCleaner( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
75 changes: 75 additions & 0 deletions
75
...ed_docs/version-2.28/pipeline-components/extractors/presidioentityextractor.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| --- | ||
| title: "PresidioEntityExtractor" | ||
| id: presidioentityextractor | ||
| slug: "/presidioentityextractor" | ||
| description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioEntityExtractor | ||
|
|
||
| `PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII metadata added | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. | ||
|
|
||
| The extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
| Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor() | ||
| result = extractor.run(documents=[ | ||
| Document(content="Contact Alice at alice@example.com") | ||
| ]) | ||
| print(result["documents"][0].meta["entities"]) | ||
| # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, | ||
| # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize entity detection, pass parameters when initializing the extractor: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.