-
Notifications
You must be signed in to change notification settings - Fork 2.8k
docs: add Presidio component docs pages #11165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
0ba983e
3a6b827
d80028e
d683e27
c8c3be3
2f7662f
a2d584b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| --- | ||
| title: "PresidioEntityExtractor" | ||
| id: presidioentityextractor | ||
| slug: "/presidioentityextractor" | ||
| description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioEntityExtractor | ||
|
|
||
| `PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII metadata added | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. | ||
|
|
||
| Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. | ||
|
sjrl marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. | ||
| python -m spacy download en_core_web_lg | ||
|
sjrl marked this conversation as resolved.
|
||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor() | ||
| result = extractor.run(documents=[ | ||
| Document(content="Contact Alice at alice@example.com") | ||
| ]) | ||
| print(result["documents"][0].meta["entities"]) | ||
| # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, | ||
| # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize entity detection, pass parameters when initializing the extractor: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
|
sjrl marked this conversation as resolved.
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| --- | ||
| title: "PresidioDocumentCleaner" | ||
| id: presidiodocumentcleaner | ||
| slug: "/presidiodocumentcleaner" | ||
| description: "Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioDocumentCleaner | ||
|
|
||
| `PresidioDocumentCleaner` replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as `<PERSON>` or `<EMAIL_ADDRESS>`. Original Documents are not mutated. Documents without text content pass through unchanged. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII replaced | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the English NLP model required by Presidio's analyzer engine | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| cleaner = PresidioDocumentCleaner() | ||
| result = cleaner.run(documents=[ | ||
| Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.") | ||
| ]) | ||
| print(result["documents"][0].content) | ||
| # Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>. | ||
| ``` | ||
|
|
||
| ### In a pipeline | ||
|
|
||
| ```python | ||
| from haystack import Document, Pipeline | ||
| from haystack.components.writers import DocumentWriter | ||
| from haystack.document_stores.in_memory import InMemoryDocumentStore | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| document_store = InMemoryDocumentStore() | ||
|
|
||
| indexing_pipeline = Pipeline() | ||
| indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner()) | ||
| indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store)) | ||
| indexing_pipeline.connect("cleaner", "writer") | ||
|
|
||
| indexing_pipeline.run({ | ||
| "cleaner": { | ||
| "documents": [ | ||
| Document(content="Alice Smith's email is alice@example.com"), | ||
| Document(content="Call Bob at 212-555-9876"), | ||
| ] | ||
| } | ||
| }) | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's follow the same structure as the entity extractor page and put this configuration section right after the overview section |
||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ```python | ||
|
sjrl marked this conversation as resolved.
|
||
| from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner | ||
|
|
||
| cleaner = PresidioDocumentCleaner( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
|
|
||
| See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Forgot to remove this line and also put the link the configuration table. Make sure to do this for the text cleaner as well |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| --- | ||
| title: "PresidioTextCleaner" | ||
| id: presidiotextcleaner | ||
| slug: "/presidiotextcleaner" | ||
| description: "Use `PresidioTextCleaner` to replace PII in plain strings, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioTextCleaner | ||
|
|
||
| `PresidioTextCleaner` replaces personally identifiable information (PII) in plain strings. It takes a `list[str]` as input and returns a `list[str]`, making it easy to sanitize user queries before they are sent to an LLM. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In a query pipeline, before a Generator or Chat Generator | | ||
| | **Mandatory run variables** | `texts`: A list of strings | | ||
| | **Output variables** | `texts`: A list of strings with PII replaced | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
|
sjrl marked this conversation as resolved.
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the English NLP model required by Presidio's analyzer engine | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| cleaner = PresidioTextCleaner() | ||
| result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"]) | ||
| print(result["texts"][0]) | ||
| # My name is <PERSON>, my SSN is <US_SSN> | ||
| ``` | ||
|
|
||
| ### In a pipeline | ||
|
|
||
| ```python | ||
| from haystack import Pipeline | ||
| from haystack.components.builders import ChatPromptBuilder | ||
| from haystack.components.generators.chat import OpenAIChatGenerator | ||
| from haystack.dataclasses import ChatMessage | ||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| template = [ChatMessage.from_user("Answer this question: {{query}}")] | ||
|
|
||
| query_pipeline = Pipeline() | ||
| query_pipeline.add_component("cleaner", PresidioTextCleaner()) | ||
| query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template)) | ||
| query_pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini")) | ||
| query_pipeline.connect("cleaner.texts[0]", "prompt_builder.query") | ||
| query_pipeline.connect("prompt_builder", "llm") | ||
|
|
||
| query_pipeline.run({ | ||
| "cleaner": {"texts": ["My name is John Smith. What is the capital of France?"]} | ||
| }) | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here put this right after the overview section |
||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ```python | ||
|
sjrl marked this conversation as resolved.
|
||
| from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner | ||
|
|
||
| cleaner = PresidioTextCleaner( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` | ||
|
|
||
| See [Presidio supported entities](https://microsoft.github.io/presidio/supported_entities/) for the full list of detectable PII types. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here make sure to remove this line and add the link into the configuration table |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| --- | ||
| title: "PresidioEntityExtractor" | ||
| id: presidioentityextractor | ||
| slug: "/presidioentityextractor" | ||
| description: "Use `PresidioEntityExtractor` to detect PII in Documents and store the entities as structured metadata, powered by Microsoft Presidio." | ||
| --- | ||
|
|
||
| # PresidioEntityExtractor | ||
|
|
||
| `PresidioEntityExtractor` detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the `"entities"` key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score. | ||
|
|
||
| <div className="key-value-table"> | ||
|
|
||
| | | | | ||
| | --- | --- | | ||
| | **Most common position in a pipeline** | In an indexing pipeline, before writing Documents to a Document Store | | ||
| | **Mandatory run variables** | `documents`: A list of Document objects | | ||
| | **Output variables** | `documents`: A list of Document objects with PII metadata added | | ||
| | **API reference** | [Presidio](/reference/integrations-presidio) | | ||
| | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio | | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| [Microsoft Presidio](https://microsoft.github.io/presidio/) is an open-source framework for PII detection and anonymization. `PresidioEntityExtractor` uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more. | ||
|
|
||
| Unlike the cleaner components, the extractor does **not** modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it. | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install presidio-haystack | ||
| # Download the spaCy NLP model for English. For other languages, replace with the appropriate spaCy model. | ||
| python -m spacy download en_core_web_lg | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| | Parameter | Default | Description | | ||
| | --- | --- | --- | | ||
| | `language` | `"en"` | Language code for PII detection. See [supported languages](https://microsoft.github.io/presidio/analyzer/languages/). | | ||
| | `entities` | `None` | List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See [supported entities](https://microsoft.github.io/presidio/supported_entities/). | | ||
| | `score_threshold` | `0.35` | Minimum confidence score (0–1) for a detected entity to be included. | | ||
|
|
||
| ## Usage | ||
|
|
||
| Install the `presidio-haystack` package to use the `PresidioEntityExtractor`. | ||
|
|
||
| ### On its own | ||
|
|
||
| ```python | ||
| from haystack import Document | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor() | ||
| result = extractor.run(documents=[ | ||
| Document(content="Contact Alice at alice@example.com") | ||
| ]) | ||
| print(result["documents"][0].meta["entities"]) | ||
| # [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85}, | ||
| # {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}] | ||
| ``` | ||
|
|
||
| ### Using Custom Parameters | ||
|
|
||
| To customize entity detection, pass parameters when initializing the extractor: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor | ||
|
|
||
| extractor = PresidioEntityExtractor( | ||
| language="de", | ||
| entities=["PERSON", "EMAIL_ADDRESS"], | ||
| score_threshold=0.7, | ||
| ) | ||
| ``` |
Uh oh!
There was an error while loading. Please reload this page.