Skip to content

Commit ccba19d

Browse files
akamorbilgeyucel
andauthored
Add Tonic Textual integration (#420)
* Adding Textual documentation * Update integrations/tonic-textual.md --------- Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai>
1 parent 189fc64 commit ccba19d

File tree

2 files changed

+172
-0
lines changed

2 files changed

+172
-0
lines changed

integrations/tonic-textual.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
layout: integration
3+
name: Tonic Textual
4+
description: PII detection, transformation, and entity extraction for Haystack pipelines, powered by Tonic Textual.
5+
authors:
6+
- name: Tonic AI
7+
socials:
8+
github: tonicai
9+
pypi: https://pypi.org/project/textual-haystack/
10+
repo: https://github.com/tonicai/textual-haystack
11+
type: Custom Component
12+
report_issue: https://github.com/tonicai/textual-haystack/issues
13+
logo: /logos/tonic-textual.png
14+
version: Haystack 2.0
15+
toc: true
16+
---
17+
18+
**Table of Contents**
19+
20+
- [Overview](#overview)
21+
- [Installation](#installation)
22+
- [Usage](#usage)
23+
- [Document Cleaning](#document-cleaning)
24+
- [Entity Extraction](#entity-extraction)
25+
- [Pipeline Usage](#pipeline-usage)
26+
- [Configuration](#configuration)
27+
- [License](#license)
28+
29+
## Overview
30+
31+
[Tonic Textual](https://docs.tonic.ai/textual) is a PII detection and transformation platform powered by transformer-based NER models that identify 46+ entity types across 50+ languages.
32+
33+
`textual-haystack` provides two Haystack components:
34+
35+
| Component | Purpose |
36+
|-----------|---------|
37+
| `TonicTextualDocumentCleaner` | Synthesize or tokenize PII in document content before ingestion |
38+
| `TonicTextualEntityExtractor` | Extract PII entities and store them as structured document metadata |
39+
40+
Use the document cleaner to sanitize documents before they enter your RAG pipeline — replacing real PII with realistic synthetic data or reversible placeholder tokens. Use the entity extractor to detect PII and attach structured metadata (entity type, value, location, confidence) to documents for hybrid retrieval, auditing, or compliance workflows.
41+
42+
## Installation
43+
44+
```bash
45+
pip install textual-haystack
46+
```
47+
48+
You will need a [Tonic Textual](https://textual.tonic.ai) API key:
49+
50+
```bash
51+
export TONIC_TEXTUAL_API_KEY="your-api-key"
52+
```
53+
54+
## Usage
55+
56+
### Document Cleaning
57+
58+
Sanitize documents before ingestion by synthesizing PII with realistic fake data:
59+
60+
```python
61+
from haystack.dataclasses import Document
62+
from haystack_integrations.components.tonic_textual import TonicTextualDocumentCleaner
63+
64+
cleaner = TonicTextualDocumentCleaner(generator_default="Synthesis")
65+
result = cleaner.run(documents=[
66+
Document(content="Patient John Smith, DOB 03/15/1982, was admitted for chest pain.")
67+
])
68+
print(result["documents"][0].content)
69+
# "Patient Maria Chen, DOB 07/22/1975, was admitted for chest pain."
70+
```
71+
72+
Or tokenize PII with reversible placeholder tokens:
73+
74+
```python
75+
cleaner = TonicTextualDocumentCleaner(generator_default="Redaction")
76+
result = cleaner.run(documents=[
77+
Document(content="Contact Jane Doe at jane@example.com.")
78+
])
79+
print(result["documents"][0].content)
80+
# "Contact [NAME_GIVEN_xxxx] [NAME_FAMILY_xxxx] at [EMAIL_ADDRESS_xxxx]."
81+
```
82+
83+
### Entity Extraction
84+
85+
Detect PII entities and store them as structured metadata on documents:
86+
87+
```python
88+
from haystack.dataclasses import Document
89+
from haystack_integrations.components.tonic_textual import TonicTextualEntityExtractor
90+
91+
extractor = TonicTextualEntityExtractor()
92+
result = extractor.run(documents=[
93+
Document(content="My name is John Smith and my email is john@example.com.")
94+
])
95+
96+
for entity in TonicTextualEntityExtractor.get_stored_annotations(result["documents"][0]):
97+
print(f"{entity.entity}: {entity.text} (confidence: {entity.score:.2f})")
98+
# NAME_GIVEN: John (confidence: 0.90)
99+
# NAME_FAMILY: Smith (confidence: 0.90)
100+
# EMAIL_ADDRESS: john@example.com (confidence: 0.95)
101+
```
102+
103+
Annotations are stored in `doc.meta["named_entities"]` as `PiiEntityAnnotation` dataclass instances with `entity`, `text`, `start`, `end`, and `score` fields.
104+
105+
### Pipeline Usage
106+
107+
Both components accept and return `list[Document]`, so they slot directly into any Haystack pipeline. Here they are chained together — clean PII first, then extract entities from the cleaned text:
108+
109+
```python
110+
from haystack import Pipeline
111+
from haystack.dataclasses import Document
112+
from haystack_integrations.components.tonic_textual import (
113+
TonicTextualDocumentCleaner,
114+
TonicTextualEntityExtractor,
115+
)
116+
117+
pipeline = Pipeline()
118+
pipeline.add_component("cleaner", TonicTextualDocumentCleaner(generator_default="Synthesis"))
119+
pipeline.add_component("extractor", TonicTextualEntityExtractor())
120+
pipeline.connect("cleaner", "extractor")
121+
122+
result = pipeline.run({
123+
"cleaner": {
124+
"documents": [
125+
Document(content="Contact Jane Doe at jane@example.com or (555) 867-5309."),
126+
]
127+
}
128+
})
129+
130+
for doc in result["extractor"]["documents"]:
131+
entities = TonicTextualEntityExtractor.get_stored_annotations(doc)
132+
print(f"Cleaned: {doc.content}")
133+
print(f"Entities: {[(e.entity, e.text) for e in entities]}")
134+
```
135+
136+
### Configuration
137+
138+
**Per-entity control** — mix synthesis and tokenization per PII type:
139+
140+
```python
141+
cleaner = TonicTextualDocumentCleaner(
142+
generator_default="Off",
143+
generator_config={
144+
"NAME_GIVEN": "Synthesis",
145+
"NAME_FAMILY": "Synthesis",
146+
"EMAIL_ADDRESS": "Redaction",
147+
"US_SSN": "Redaction",
148+
},
149+
)
150+
```
151+
152+
**Self-hosted deployment:**
153+
154+
```python
155+
cleaner = TonicTextualDocumentCleaner(
156+
base_url="https://textual.your-company.com"
157+
)
158+
```
159+
160+
**Explicit API key:**
161+
162+
```python
163+
from haystack.utils.auth import Secret
164+
165+
cleaner = TonicTextualDocumentCleaner(
166+
api_key=Secret.from_token("your-api-key")
167+
)
168+
```
169+
170+
## License
171+
172+
`textual-haystack` is licensed under the [MIT License](https://github.com/tonicai/textual-haystack/blob/main/LICENSE).

logos/tonic-textual.png

29.5 KB
Loading

0 commit comments

Comments
 (0)