|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Tonic Textual |
| 4 | +description: PII detection, transformation, and entity extraction for Haystack pipelines, powered by Tonic Textual. |
| 5 | +authors: |
| 6 | + - name: Tonic AI |
| 7 | + socials: |
| 8 | + github: tonicai |
| 9 | +pypi: https://pypi.org/project/textual-haystack/ |
| 10 | +repo: https://github.com/tonicai/textual-haystack |
| 11 | +type: Custom Component |
| 12 | +report_issue: https://github.com/tonicai/textual-haystack/issues |
| 13 | +logo: /logos/tonic-textual.png |
| 14 | +version: Haystack 2.0 |
| 15 | +toc: true |
| 16 | +--- |
| 17 | + |
| 18 | +**Table of Contents** |
| 19 | + |
| 20 | +- [Overview](#overview) |
| 21 | +- [Installation](#installation) |
| 22 | +- [Usage](#usage) |
| 23 | + - [Document Cleaning](#document-cleaning) |
| 24 | + - [Entity Extraction](#entity-extraction) |
| 25 | + - [Pipeline Usage](#pipeline-usage) |
| 26 | + - [Configuration](#configuration) |
| 27 | +- [License](#license) |
| 28 | + |
| 29 | +## Overview |
| 30 | + |
| 31 | +[Tonic Textual](https://docs.tonic.ai/textual) is a PII detection and transformation platform powered by transformer-based NER models that identify 46+ entity types across 50+ languages. |
| 32 | + |
| 33 | +`textual-haystack` provides two Haystack components: |
| 34 | + |
| 35 | +| Component | Purpose | |
| 36 | +|-----------|---------| |
| 37 | +| `TonicTextualDocumentCleaner` | Synthesize or tokenize PII in document content before ingestion | |
| 38 | +| `TonicTextualEntityExtractor` | Extract PII entities and store them as structured document metadata | |
| 39 | + |
| 40 | +Use the document cleaner to sanitize documents before they enter your RAG pipeline — replacing real PII with realistic synthetic data or reversible placeholder tokens. Use the entity extractor to detect PII and attach structured metadata (entity type, value, location, confidence) to documents for hybrid retrieval, auditing, or compliance workflows. |
| 41 | + |
| 42 | +## Installation |
| 43 | + |
| 44 | +```bash |
| 45 | +pip install textual-haystack |
| 46 | +``` |
| 47 | + |
| 48 | +You will need a [Tonic Textual](https://textual.tonic.ai) API key: |
| 49 | + |
| 50 | +```bash |
| 51 | +export TONIC_TEXTUAL_API_KEY="your-api-key" |
| 52 | +``` |
| 53 | + |
| 54 | +## Usage |
| 55 | + |
| 56 | +### Document Cleaning |
| 57 | + |
| 58 | +Sanitize documents before ingestion by synthesizing PII with realistic fake data: |
| 59 | + |
| 60 | +```python |
| 61 | +from haystack.dataclasses import Document |
| 62 | +from haystack_integrations.components.tonic_textual import TonicTextualDocumentCleaner |
| 63 | + |
| 64 | +cleaner = TonicTextualDocumentCleaner(generator_default="Synthesis") |
| 65 | +result = cleaner.run(documents=[ |
| 66 | + Document(content="Patient John Smith, DOB 03/15/1982, was admitted for chest pain.") |
| 67 | +]) |
| 68 | +print(result["documents"][0].content) |
| 69 | +# "Patient Maria Chen, DOB 07/22/1975, was admitted for chest pain." |
| 70 | +``` |
| 71 | + |
| 72 | +Or tokenize PII with reversible placeholder tokens: |
| 73 | + |
| 74 | +```python |
| 75 | +cleaner = TonicTextualDocumentCleaner(generator_default="Redaction") |
| 76 | +result = cleaner.run(documents=[ |
| 77 | + Document(content="Contact Jane Doe at jane@example.com.") |
| 78 | +]) |
| 79 | +print(result["documents"][0].content) |
| 80 | +# "Contact [NAME_GIVEN_xxxx] [NAME_FAMILY_xxxx] at [EMAIL_ADDRESS_xxxx]." |
| 81 | +``` |
| 82 | + |
| 83 | +### Entity Extraction |
| 84 | + |
| 85 | +Detect PII entities and store them as structured metadata on documents: |
| 86 | + |
| 87 | +```python |
| 88 | +from haystack.dataclasses import Document |
| 89 | +from haystack_integrations.components.tonic_textual import TonicTextualEntityExtractor |
| 90 | + |
| 91 | +extractor = TonicTextualEntityExtractor() |
| 92 | +result = extractor.run(documents=[ |
| 93 | + Document(content="My name is John Smith and my email is john@example.com.") |
| 94 | +]) |
| 95 | + |
| 96 | +for entity in TonicTextualEntityExtractor.get_stored_annotations(result["documents"][0]): |
| 97 | + print(f"{entity.entity}: {entity.text} (confidence: {entity.score:.2f})") |
| 98 | +# NAME_GIVEN: John (confidence: 0.90) |
| 99 | +# NAME_FAMILY: Smith (confidence: 0.90) |
| 100 | +# EMAIL_ADDRESS: john@example.com (confidence: 0.95) |
| 101 | +``` |
| 102 | + |
| 103 | +Annotations are stored in `doc.meta["named_entities"]` as `PiiEntityAnnotation` dataclass instances with `entity`, `text`, `start`, `end`, and `score` fields. |
| 104 | + |
| 105 | +### Pipeline Usage |
| 106 | + |
| 107 | +Both components accept and return `list[Document]`, so they slot directly into any Haystack pipeline. Here they are chained together — clean PII first, then extract entities from the cleaned text: |
| 108 | + |
| 109 | +```python |
| 110 | +from haystack import Pipeline |
| 111 | +from haystack.dataclasses import Document |
| 112 | +from haystack_integrations.components.tonic_textual import ( |
| 113 | + TonicTextualDocumentCleaner, |
| 114 | + TonicTextualEntityExtractor, |
| 115 | +) |
| 116 | + |
| 117 | +pipeline = Pipeline() |
| 118 | +pipeline.add_component("cleaner", TonicTextualDocumentCleaner(generator_default="Synthesis")) |
| 119 | +pipeline.add_component("extractor", TonicTextualEntityExtractor()) |
| 120 | +pipeline.connect("cleaner", "extractor") |
| 121 | + |
| 122 | +result = pipeline.run({ |
| 123 | + "cleaner": { |
| 124 | + "documents": [ |
| 125 | + Document(content="Contact Jane Doe at jane@example.com or (555) 867-5309."), |
| 126 | + ] |
| 127 | + } |
| 128 | +}) |
| 129 | + |
| 130 | +for doc in result["extractor"]["documents"]: |
| 131 | + entities = TonicTextualEntityExtractor.get_stored_annotations(doc) |
| 132 | + print(f"Cleaned: {doc.content}") |
| 133 | + print(f"Entities: {[(e.entity, e.text) for e in entities]}") |
| 134 | +``` |
| 135 | + |
| 136 | +### Configuration |
| 137 | + |
| 138 | +**Per-entity control** — mix synthesis and tokenization per PII type: |
| 139 | + |
| 140 | +```python |
| 141 | +cleaner = TonicTextualDocumentCleaner( |
| 142 | + generator_default="Off", |
| 143 | + generator_config={ |
| 144 | + "NAME_GIVEN": "Synthesis", |
| 145 | + "NAME_FAMILY": "Synthesis", |
| 146 | + "EMAIL_ADDRESS": "Redaction", |
| 147 | + "US_SSN": "Redaction", |
| 148 | + }, |
| 149 | +) |
| 150 | +``` |
| 151 | + |
| 152 | +**Self-hosted deployment:** |
| 153 | + |
| 154 | +```python |
| 155 | +cleaner = TonicTextualDocumentCleaner( |
| 156 | + base_url="https://textual.your-company.com" |
| 157 | +) |
| 158 | +``` |
| 159 | + |
| 160 | +**Explicit API key:** |
| 161 | + |
| 162 | +```python |
| 163 | +from haystack.utils.auth import Secret |
| 164 | + |
| 165 | +cleaner = TonicTextualDocumentCleaner( |
| 166 | + api_key=Secret.from_token("your-api-key") |
| 167 | +) |
| 168 | +``` |
| 169 | + |
| 170 | +## License |
| 171 | + |
| 172 | +`textual-haystack` is licensed under the [MIT License](https://github.com/tonicai/textual-haystack/blob/main/LICENSE). |
0 commit comments