| title |
PresidioDocumentCleaner |
| id |
presidiodocumentcleaner |
| slug |
/presidiodocumentcleaner |
| description |
Use `PresidioDocumentCleaner` to replace PII in Document text with entity type placeholders, powered by Microsoft Presidio. |
PresidioDocumentCleaner replaces personally identifiable information (PII) in the text content of Documents with entity type placeholders such as <PERSON> or <EMAIL_ADDRESS>. Original Documents are not mutated. Documents without text content pass through unchanged.
pip install presidio-haystack
# Download the English NLP model required by Presidio's analyzer engine
python -m spacy download en_core_web_lg
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[
Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
])
print(result["documents"][0].content)
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.
from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("cleaner", PresidioDocumentCleaner())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("cleaner", "writer")
indexing_pipeline.run({
"cleaner": {
"documents": [
Document(content="Alice Smith's email is alice@example.com"),
Document(content="Call Bob at 212-555-9876"),
]
}
})
| Parameter |
Default |
Description |
language |
"en" |
Language code for PII detection. See supported languages. |
entities |
None |
List of PII entity types to detect (e.g. ["PERSON", "EMAIL_ADDRESS"]). If None, all supported types are detected. |
score_threshold |
0.35 |
Minimum confidence score (0–1) for a detected entity to be included. |
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
cleaner = PresidioDocumentCleaner(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"],
score_threshold=0.7,
)
See Presidio supported entities for the full list of detectable PII types.