You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Contact John Smith at john.smith@company.com or +1-555-0123"
241
+
)
242
+
# Output: "Contact Jennifer Williams at lisa45@example.net or +1-555-0198"
243
+
# Same entity always maps to same fake value within a document
244
+
```
245
+
246
+
> **Note**: Synthetic replacement preserves data utility for downstream ML tasks but is NOT GDPR-compliant anonymization. Same-document consistency is maintained (same entity text always maps to the same fake value).
247
+
248
+
### Reversible Anonymization
249
+
250
+
Replace entities with indexed placeholders (`<PERSON_0>`, `<LOCATION_1>`) and get a mapping for round-trip deanonymization:
251
+
252
+
```python
253
+
from sct import TextCleaner, TextCleanerConfig
254
+
255
+
cfg = TextCleanerConfig(
256
+
ner_mode='pii',
257
+
replacement_mode='reversible',
258
+
)
259
+
260
+
cleaner = TextCleaner(cfg=cfg)
261
+
result = cleaner.process("John Smith works at Google in London.")
262
+
263
+
print(result.lm_text)
264
+
# "<PERSON_0> works at <ORGANISATION_0> in <LOCATION_0>."
265
+
266
+
# Access the anonymization map via metadata
267
+
anon_map = result.metadata['anon_map']
268
+
restored = anon_map.deanonymize(result.lm_text)
269
+
# "John Smith works at Google in London."
270
+
271
+
# Serialize the map for storage
272
+
import json
273
+
json.dumps(anon_map.to_dict())
274
+
```
275
+
276
+
> **Note**: `ProcessResult` from `process()` unpacks as a 3-tuple (`lm_text, stat_text, language`) for backward compatibility, but also exposes `.metadata` for reversible maps and document classification.
277
+
278
+
### Document Classification (GLiClass)
279
+
280
+
Classify documents before processing using zero-shot classification with [GLiClass](https://github.com/Knowledgator/GLiClass):
|`onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free |
257
398
|`torch`| PyTorch/Transformers pipeline with full XLM-RoBERTa models |`[torch]` extra | Compatibility with existing PyTorch workflows |
258
-
|`gliner`| GLiNER zero-shot NER with custom entity labels |`[gliner]` or `[gliner2]` extra | Custom entity types (PRODUCT, SKILL, EVENT, etc.)|
399
+
|`gliner`| GLiNER zero-shot NER with custom entity labels |`[gliner]` or `[gliner2]` extra | Custom entity types, PII detection, bi-encoder models|
259
400
|`ensemble_onnx`| ONNX + GLiNER ensemble voting |`[gliner]` extra | Maximum recall with custom entities |
260
401
|`ensemble_torch`| Torch + GLiNER ensemble voting |`[torch,gliner]` extra | Maximum recall with PyTorch |
402
+
|`presidio_gliner`| Presidio + GLiNER recognizer (beta) |`presidio-analyzer`, `[gliner]`| Context-aware NER via Presidio's pipeline |
261
403
262
404
### Default NER Models (ONNX)
263
405
@@ -270,6 +412,17 @@ SqueakyCleanText supports five NER backends, selectable via the `ner_backend` co
270
412
| French / Portuguese / Italian |[`rhnfzl/wikineural-multilingual-ner-onnx`](https://huggingface.co/rhnfzl/wikineural-multilingual-ner-onnx) (shared session) |
|`urchade/gliner_large-v2.1`| Uni-encoder (DeBERTa) | 512 | Multi | Legacy, high accuracy on short texts |
422
+
|`MatteoFasulo/ModernBERT-base-NER`| ModernBERT | 8192 | English | English-only, very long context |
423
+
424
+
> **GLiNER2 note**: `pip install squeakycleantext[gliner2]` installs [Knowledgator's gliner2 package](https://github.com/Knowledgator/GLiNER), not Fastino AI's GLiNER2 from EMNLP 2025 (different API).
425
+
273
426
### GLiNER Label Mapping
274
427
275
428
GLiNER uses lowercase free-text labels (e.g., `'person'`, `'product'`). To map them to standard NER tags used by the anonymizer, use `gliner_label_map`:
|`gliner_label_descriptions`|`None`| ZERONER-style: `{label: "description"}` for improved zero-shot accuracy |
394
550
|`fuzzy_date_score_cutoff`|`85`| Fuzzy matching threshold (0-100) for misspelled months |
395
551
|`custom_pipeline_steps`|`()`| Tuple of `(text: str) -> str` callables appended after all built-in steps |
396
552
397
553
**Language settings**:
398
554
399
555
| Field | Default | Description |
400
556
|-------|---------|-------------|
401
-
|`language`|`None`| Pin language (skip detection)|
402
-
|`extra_languages`|`()`| Additional language names for detection |
557
+
|`language`|`None`| Pin language (`'en'`), restrict detection to a set (`('en','nl')`), or `None` for auto-detect. Accepts Lingua names, ISO 639-1, ISO 639-3 codes.|
558
+
|`extra_languages`|`()`| Additional language names/codes for detection |
-**Synthetic replacement** (`replacement_mode='synthetic'`): Faker-generated realistic values instead of `<TAG>` placeholders, with per-document consistency
604
+
-**Reversible anonymization** (`replacement_mode='reversible'`): indexed placeholders (`<PERSON_0>`) with `AnonymizationMap` for round-trip deanonymization
605
+
-**Document classification** (`check_classify_document=True`): zero-shot GLiClass pre-classification before text processing
606
+
-**ProcessResult**: `process()` returns `ProcessResult` (backward-compatible 3-tuple) with `.metadata` for anonymization maps and classification results
607
+
-**GLiNER ONNX mode** (`gliner_onnx=True`): load GLiNER with pre-built ONNX weights from HuggingFace Hub (auto-set for PII + ONNX backend)
608
+
-**Bi-encoder support**: auto-detects ModernBERT and other bi-encoder GLiNER models, caches label embeddings, dynamic context windows (2048-8192 tokens)
609
+
-**Entity description labels**: ZERONER-style natural-language descriptions for improved zero-shot accuracy
0 commit comments