Skip to content

Latest commit

 

History

History
91 lines (70 loc) · 3.26 KB

File metadata and controls

91 lines (70 loc) · 3.26 KB
title PresidioDocumentCleaner
id presidiodocumentcleaner
slug /presidiodocumentcleaner
description Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio.

PresidioDocumentCleaner

PresidioDocumentCleaner replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as <PERSON> or <EMAIL_ADDRESS>. Original Documents are not mutated. Documents without text content pass through unchanged.

Most common position in a pipeline In an indexing pipeline, before writing Documents to a Document Store
Mandatory run variables documents: A list of Document objects
Output variables documents: A list of Document objects with PII replaced
API reference Presidio
GitHub link https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio

Installation

pip install presidio-haystack
# Download the English NLP model required by Presidio's analyzer engine
python -m spacy download en_core_web_lg

Usage

On its own

from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[
    Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
])
print(result["documents"][0].content)
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.

In a pipeline

from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("cleaner", "writer")

indexing_pipeline.run({
    "cleaner": {
        "documents": [
            Document(content="Alice Smith's email is alice@example.com"),
            Document(content="Call Bob at 212-555-9876"),
        ]
    }
})

Configuration

Parameter Default Description
language "en" Language code for PII detection. See supported languages.
entities None List of PII entity types to detect (e.g. ["PERSON", "EMAIL_ADDRESS"]). If None, all supported types are detected.
score_threshold 0.35 Minimum confidence score (0–1) for a detected entity to be included.
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner(
    language="de",
    entities=["PERSON", "EMAIL_ADDRESS"],
    score_threshold=0.7,
)

See Presidio supported entities for the full list of detectable PII types.