Skip to content

Commit 9480bc2

Browse files
committed
Release v0.6.0: GLiNER modernization, reversible anonymization, document classification
- Bi-encoder support with auto-detection, label embedding cache, dynamic context windows - PII mode (ner_mode='pii') auto-configures GLiNER with 60+ entity labels - Synthetic replacement (replacement_mode='synthetic') via Faker for realistic anonymization - Reversible anonymization (replacement_mode='reversible') with indexed placeholders and AnonymizationMap for round-trip deanonymize - GLiClass document-level pre-classification (check_classify_document=True) - ProcessResult return type: backward-compatible 3-tuple unpacking + .metadata dict - GLiNER ONNX loading support (gliner_onnx=True) - Presidio GLiNER recognizer backend (ner_backend='presidio_gliner', beta) - ZERONER-style entity description labels (gliner_label_descriptions) - ModernBERT ONNX export script support - Updated all dependency versions to latest stable releases - 178 tests passing across 17 test classes
1 parent def20dc commit 9480bc2

19 files changed

Lines changed: 2262 additions & 210 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,8 @@ snap.py
171171
CLAUDE.md
172172
SPECS.md
173173
docs/plans/
174+
docs/GLINER_GAP_ANALYSIS.md
175+
docs/V060_PLAN.md
174176
squeakycleantext-explorer.html
175177
ralph-loop-prompt.md
176178
.claude/

README.md

Lines changed: 175 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,9 @@ The base install uses **ONNX Runtime** for NER inference - no PyTorch or Transfo
9999
| PyTorch NER | `pip install SqueakyCleanText[torch]` | PyTorch/Transformers NER backend |
100100
| GLiNER | `pip install SqueakyCleanText[gliner]` | [GLiNER](https://github.com/urchade/GLiNER) zero-shot NER |
101101
| GLiNER2 | `pip install SqueakyCleanText[gliner2]` | [GLiNER2](https://github.com/Knowledgator/GLiNER) (knowledgator) backend |
102+
| Synthetic | `pip install SqueakyCleanText[synthetic]` | Faker-based synthetic replacement (realistic fake values instead of `<TAG>` tokens) |
103+
| Presidio | `pip install SqueakyCleanText[presidio]` | Presidio-analyzer for `presidio_gliner` backend |
104+
| Classify | `pip install SqueakyCleanText[classify]` | GLiClass document-level pre-classification |
102105
| All NER | `pip install SqueakyCleanText[all-ner]` | All NER backends combined |
103106
| Development | `pip install SqueakyCleanText[dev]` | Testing and linting tools |
104107

@@ -143,7 +146,7 @@ cfg = TextCleanerConfig(
143146
replace_with_url="<URL>",
144147
replace_with_email="<EMAIL>",
145148
replace_with_phone_numbers="<PHONE>",
146-
language="ENGLISH", # Skip auto-detection
149+
language="en", # Pin to English (also accepts 'ENGLISH', 'eng')
147150
)
148151

149152
# Initialize with config
@@ -193,6 +196,144 @@ cleaner = TextCleaner(cfg=cfg)
193196
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")
194197
```
195198

199+
### PII Detection Mode
200+
201+
Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):
202+
203+
```python
204+
from sct import TextCleaner, TextCleanerConfig
205+
206+
cfg = TextCleanerConfig(ner_mode='pii')
207+
208+
cleaner = TextCleaner(cfg=cfg)
209+
lm_text, stat_text, lang = cleaner.process(
210+
"John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
211+
)
212+
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types
213+
```
214+
215+
PII mode auto-configures: `ner_backend='gliner'`, uses [`knowledgator/gliner-pii-base-v1.0`](https://huggingface.co/knowledgator/gliner-pii-base-v1.0), sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.
216+
217+
**Alternative PII models** (pass as `gliner_model`):
218+
219+
| Model | Type | Size | Labels | F1 |
220+
|-------|------|------|--------|-----|
221+
| [`knowledgator/gliner-pii-base-v1.0`](https://huggingface.co/knowledgator/gliner-pii-base-v1.0) | Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% |
222+
| [`nvidia/gliner-PII`](https://huggingface.co/nvidia/gliner-PII) | Bi-encoder | 570MB | 55+ ||
223+
| [`gretelai/gretel-gliner-bi-base-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-base-v1.0) | Bi-encoder | ~800MB | 40+ | 95% |
224+
| [`urchade/gliner_multi_pii-v1`](https://huggingface.co/urchade/gliner_multi_pii-v1) | Multilingual ||||
225+
226+
### Synthetic Replacement
227+
228+
Replace detected entities with realistic fake values (via [Faker](https://faker.readthedocs.io/)) instead of `<TAG>` placeholder tokens:
229+
230+
```python
231+
from sct import TextCleaner, TextCleanerConfig
232+
233+
cfg = TextCleanerConfig(
234+
ner_mode='pii',
235+
replacement_mode='synthetic', # pip install squeakycleantext[synthetic]
236+
)
237+
238+
cleaner = TextCleaner(cfg=cfg)
239+
lm_text, stat_text, lang = cleaner.process(
240+
"Contact John Smith at john.smith@company.com or +1-555-0123"
241+
)
242+
# Output: "Contact Jennifer Williams at lisa45@example.net or +1-555-0198"
243+
# Same entity always maps to same fake value within a document
244+
```
245+
246+
> **Note**: Synthetic replacement preserves data utility for downstream ML tasks but is NOT GDPR-compliant anonymization. Same-document consistency is maintained (same entity text always maps to the same fake value).
247+
248+
### Reversible Anonymization
249+
250+
Replace entities with indexed placeholders (`<PERSON_0>`, `<LOCATION_1>`) and get a mapping for round-trip deanonymization:
251+
252+
```python
253+
from sct import TextCleaner, TextCleanerConfig
254+
255+
cfg = TextCleanerConfig(
256+
ner_mode='pii',
257+
replacement_mode='reversible',
258+
)
259+
260+
cleaner = TextCleaner(cfg=cfg)
261+
result = cleaner.process("John Smith works at Google in London.")
262+
263+
print(result.lm_text)
264+
# "<PERSON_0> works at <ORGANISATION_0> in <LOCATION_0>."
265+
266+
# Access the anonymization map via metadata
267+
anon_map = result.metadata['anon_map']
268+
restored = anon_map.deanonymize(result.lm_text)
269+
# "John Smith works at Google in London."
270+
271+
# Serialize the map for storage
272+
import json
273+
json.dumps(anon_map.to_dict())
274+
```
275+
276+
> **Note**: `ProcessResult` from `process()` unpacks as a 3-tuple (`lm_text, stat_text, language`) for backward compatibility, but also exposes `.metadata` for reversible maps and document classification.
277+
278+
### Document Classification (GLiClass)
279+
280+
Classify documents before processing using zero-shot classification with [GLiClass](https://github.com/Knowledgator/GLiClass):
281+
282+
```python
283+
from sct import TextCleaner, TextCleanerConfig
284+
285+
cfg = TextCleanerConfig(
286+
check_classify_document=True,
287+
gliclass_labels=('email', 'code', 'legal', 'medical'),
288+
# gliclass_model defaults to 'knowledgator/gliclass-edge-v3.0' (32.7M params)
289+
)
290+
291+
cleaner = TextCleaner(cfg=cfg) # pip install squeakycleantext[classify]
292+
result = cleaner.process("Dear Sir, please find attached the contract...")
293+
294+
# Classification results in metadata
295+
print(result.metadata['classes'])
296+
# [{"label": "email", "score": 0.92}, {"label": "legal", "score": 0.78}]
297+
```
298+
299+
### Bi-Encoder GLiNER Models
300+
301+
Bi-encoder models (ModernBERT, etc.) are auto-detected and leverage pre-computed label embeddings for faster inference with larger context windows:
302+
303+
```python
304+
from sct import TextCleaner, TextCleanerConfig
305+
306+
cfg = TextCleanerConfig(
307+
ner_backend='gliner',
308+
gliner_model='knowledgator/gliner-bi-base-v2.0',
309+
gliner_labels=('person', 'organization', 'location'),
310+
)
311+
312+
cleaner = TextCleaner(cfg=cfg)
313+
# Auto-detects bi-encoder → caches label embeddings → uses 2048+ token context window
314+
```
315+
316+
### Entity Description Labels (ZERONER-Style)
317+
318+
Provide natural-language descriptions for labels to improve zero-shot recognition accuracy:
319+
320+
```python
321+
from sct import TextCleaner, TextCleanerConfig
322+
323+
cfg = TextCleanerConfig(
324+
ner_backend='gliner',
325+
gliner_model='knowledgator/gliner-bi-base-v2.0',
326+
gliner_label_descriptions={
327+
'person': "a person's full legal name",
328+
'location': "a geographical place or address",
329+
'organization': "a company, institution, or government body",
330+
},
331+
)
332+
333+
cleaner = TextCleaner(cfg=cfg)
334+
# Descriptions are used for inference, results are mapped back to original label names
335+
```
336+
196337
### Batch Processing
197338

198339
```python
@@ -249,15 +390,16 @@ cleaner = sct.TextCleaner()
249390

250391
## NER Backends
251392

252-
SqueakyCleanText supports five NER backends, selectable via the `ner_backend` config field:
393+
SqueakyCleanText supports six NER backends, selectable via the `ner_backend` config field:
253394

254395
| Backend | Description | Dependencies | Best for |
255396
|---------|-------------|-------------|----------|
256397
| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free |
257398
| `torch` | PyTorch/Transformers pipeline with full XLM-RoBERTa models | `[torch]` extra | Compatibility with existing PyTorch workflows |
258-
| `gliner` | GLiNER zero-shot NER with custom entity labels | `[gliner]` or `[gliner2]` extra | Custom entity types (PRODUCT, SKILL, EVENT, etc.) |
399+
| `gliner` | GLiNER zero-shot NER with custom entity labels | `[gliner]` or `[gliner2]` extra | Custom entity types, PII detection, bi-encoder models |
259400
| `ensemble_onnx` | ONNX + GLiNER ensemble voting | `[gliner]` extra | Maximum recall with custom entities |
260401
| `ensemble_torch` | Torch + GLiNER ensemble voting | `[torch,gliner]` extra | Maximum recall with PyTorch |
402+
| `presidio_gliner` | Presidio + GLiNER recognizer (beta) | `presidio-analyzer`, `[gliner]` | Context-aware NER via Presidio's pipeline |
261403

262404
### Default NER Models (ONNX)
263405

@@ -270,6 +412,17 @@ SqueakyCleanText supports five NER backends, selectable via the `ner_backend` co
270412
| French / Portuguese / Italian | [`rhnfzl/wikineural-multilingual-ner-onnx`](https://huggingface.co/rhnfzl/wikineural-multilingual-ner-onnx) (shared session) |
271413
| Multilingual (fallback) | [`rhnfzl/wikineural-multilingual-ner-onnx`](https://huggingface.co/rhnfzl/wikineural-multilingual-ner-onnx) |
272414

415+
### GLiNER Model Recommendations
416+
417+
| Model | Architecture | Context | Languages | Best for |
418+
|-------|-------------|---------|-----------|----------|
419+
| `knowledgator/gliner-bi-base-v2.0` | Bi-encoder (ModernBERT) | 2048 | Multi | General NER, long documents |
420+
| `knowledgator/gliner-pii-base-v1.0` | Bi-encoder | 2048 | Multi | PII detection (60+ entity types) |
421+
| `urchade/gliner_large-v2.1` | Uni-encoder (DeBERTa) | 512 | Multi | Legacy, high accuracy on short texts |
422+
| `MatteoFasulo/ModernBERT-base-NER` | ModernBERT | 8192 | English | English-only, very long context |
423+
424+
> **GLiNER2 note**: `pip install squeakycleantext[gliner2]` installs [Knowledgator's gliner2 package](https://github.com/Knowledgator/GLiNER), not Fastino AI's GLiNER2 from EMNLP 2025 (different API).
425+
273426
### GLiNER Label Mapping
274427

275428
GLiNER uses lowercase free-text labels (e.g., `'person'`, `'product'`). To map them to standard NER tags used by the anonymizer, use `gliner_label_map`:
@@ -380,7 +533,9 @@ new_cfg = dataclasses.replace(cfg, check_ner_process=False)
380533

381534
| Field | Default | Description |
382535
|-------|---------|-------------|
383-
| `ner_backend` | `'onnx'` | Backend: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch` |
536+
| `ner_backend` | `'onnx'` | Backend: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`, `presidio_gliner` |
537+
| `ner_mode` | `'standard'` | `'standard'` or `'pii'` (auto-configures GLiNER for PII detection) |
538+
| `replacement_mode` | `'placeholder'` | `'placeholder'`, `'synthetic'` (Faker), or `'reversible'` (indexed placeholders + deanonymize map) |
384539
| `positional_tags` | `('PER', 'LOC', 'ORG', 'MISC')` | Entity types to recognize |
385540
| `ner_confidence_threshold` | `0.85` | Minimum confidence score |
386541
| `ner_batch_size` | `8` | Inference batch size (must be >= 1) |
@@ -391,15 +546,16 @@ new_cfg = dataclasses.replace(cfg, check_ner_process=False)
391546
| `gliner_labels` | `('person', 'organization', 'location')` | GLiNER entity labels |
392547
| `gliner_label_map` | `None` | Maps GLiNER labels to NER tags |
393548
| `gliner_threshold` | `0.4` | GLiNER confidence threshold |
549+
| `gliner_label_descriptions` | `None` | ZERONER-style: `{label: "description"}` for improved zero-shot accuracy |
394550
| `fuzzy_date_score_cutoff` | `85` | Fuzzy matching threshold (0-100) for misspelled months |
395551
| `custom_pipeline_steps` | `()` | Tuple of `(text: str) -> str` callables appended after all built-in steps |
396552

397553
**Language settings**:
398554

399555
| Field | Default | Description |
400556
|-------|---------|-------------|
401-
| `language` | `None` | Pin language (skip detection) |
402-
| `extra_languages` | `()` | Additional language names for detection |
557+
| `language` | `None` | Pin language (`'en'`), restrict detection to a set (`('en','nl')`), or `None` for auto-detect. Accepts Lingua names, ISO 639-1, ISO 639-3 codes. |
558+
| `extra_languages` | `()` | Additional language names/codes for detection |
403559
| `custom_stopwords` | `None` | `{LANG: frozenset({...})}` custom stopword sets |
404560
| `custom_month_names` | `None` | `{LANG: ('Jan', 'Feb', ...)}` for date detection |
405561

@@ -442,6 +598,19 @@ Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once
442598

443599
## What's New
444600

601+
**v0.6.0**
602+
- **PII detection mode** (`ner_mode='pii'`): auto-configures GLiNER with 60+ PII entity labels (personal, financial, healthcare, identity, digital)
603+
- **Synthetic replacement** (`replacement_mode='synthetic'`): Faker-generated realistic values instead of `<TAG>` placeholders, with per-document consistency
604+
- **Reversible anonymization** (`replacement_mode='reversible'`): indexed placeholders (`<PERSON_0>`) with `AnonymizationMap` for round-trip deanonymization
605+
- **Document classification** (`check_classify_document=True`): zero-shot GLiClass pre-classification before text processing
606+
- **ProcessResult**: `process()` returns `ProcessResult` (backward-compatible 3-tuple) with `.metadata` for anonymization maps and classification results
607+
- **GLiNER ONNX mode** (`gliner_onnx=True`): load GLiNER with pre-built ONNX weights from HuggingFace Hub (auto-set for PII + ONNX backend)
608+
- **Bi-encoder support**: auto-detects ModernBERT and other bi-encoder GLiNER models, caches label embeddings, dynamic context windows (2048-8192 tokens)
609+
- **Entity description labels**: ZERONER-style natural-language descriptions for improved zero-shot accuracy
610+
- **Presidio GLiNER backend** (beta): opt-in `ner_backend='presidio_gliner'` for Presidio's context-aware recognition pipeline
611+
- **ModernBERT ONNX export**: updated export script with ModernBERT support (English, 8192 token context)
612+
- **Dynamic chunk sizing**: GLiNER chunk size adapts to model's actual context window instead of hardcoded 384
613+
445614
**v0.5.x**
446615
- `aprocess_batch()`: async batch processing for FastAPI / aiohttp integrations
447616
- `warmup(languages)`: pre-load NER models at startup to eliminate first-request latency

0 commit comments

Comments
 (0)