|
| 1 | +# SqueakyCleanText |
| 2 | + |
| 3 | +> Production-ready text preprocessing and PII anonymization pipeline for Python 3.11+. |
| 4 | +> Removes noise (URLs, emails, phone numbers, stopwords), detects language automatically, |
| 5 | +> and runs ensemble NER (ONNX/PyTorch/GLiNER) for entity anonymization. |
| 6 | +> Published on PyPI as `squeakycleantext`. |
| 7 | + |
| 8 | +## Quick Start |
| 9 | + |
| 10 | +``` |
| 11 | +pip install squeakycleantext |
| 12 | +``` |
| 13 | + |
| 14 | +```python |
| 15 | +from sct import TextCleaner, TextCleanerConfig |
| 16 | + |
| 17 | +cleaner = TextCleaner(cfg=TextCleanerConfig(check_ner_process=True)) |
| 18 | +lm_text, stat_text, lang = cleaner.process("Contact John at john@acme.com or +1-555-123-4567.") |
| 19 | +# lm_text: "Contact <PERSON> at <EMAIL> or <PHONE>." |
| 20 | +# stat_text: "contact" |
| 21 | +# lang: "ENGLISH" |
| 22 | +``` |
| 23 | + |
| 24 | +## API Surface |
| 25 | + |
| 26 | +- `TextCleaner(cfg=TextCleanerConfig(...))` — main pipeline class |
| 27 | +- `cleaner.process(text: str) -> (lm_text, stat_text, language)` |
| 28 | +- `cleaner.process_batch(texts: List[str]) -> List[tuple]` — thread-parallel |
| 29 | +- `cleaner.aprocess_batch(texts) -> List[tuple]` — async (FastAPI/aiohttp) |
| 30 | +- `cleaner.warmup(languages=None)` — pre-load NER models at startup |
| 31 | + |
| 32 | +## TextCleanerConfig — Key Fields |
| 33 | + |
| 34 | +``` |
| 35 | +check_ner_process=True # NER entity anonymization |
| 36 | +check_replace_urls=True # Replace URLs with <URL> |
| 37 | +check_replace_emails=True # Replace emails with <EMAIL> |
| 38 | +check_replace_phone_numbers=True # Replace phones with <PHONE> |
| 39 | +check_replace_dates=False # Replace dates with <DATE> |
| 40 | +check_fuzzy_replace_dates=False # Fuzzy misspelled month matching (requires [fuzzy]) |
| 41 | +check_remove_stopwords=True # Language-aware stopword removal |
| 42 | +check_remove_punctuation=True # Punctuation removal (stat output) |
| 43 | +check_remove_emoji=False # Remove emoji characters |
| 44 | +check_statistical_model_processing=True # Generate stat_text output |
| 45 | + |
| 46 | +ner_backend='onnx' # 'onnx'|'torch'|'gliner'|'ensemble_onnx'|'ensemble_torch' |
| 47 | +ner_confidence_threshold=0.85 # Min confidence for entity tagging |
| 48 | +ner_batch_size=8 # Inference batch size (must be >= 1) |
| 49 | +positional_tags=('PER','LOC','ORG','MISC') |
| 50 | + |
| 51 | +language=None # Pin language (e.g. 'ENGLISH') or None for auto-detect |
| 52 | +extra_languages=() # Add languages: 'FRENCH', 'PORTUGUESE', 'ITALIAN' |
| 53 | +custom_stopwords=None # {LANG: frozenset({...})} |
| 54 | +custom_pipeline_steps=() # Tuple of (text: str) -> str callables |
| 55 | +``` |
| 56 | + |
| 57 | +## NER Backends |
| 58 | + |
| 59 | +``` |
| 60 | +onnx (default) — ONNX Runtime, torch-free, ~3-5x faster than PyTorch — base install |
| 61 | +torch — PyTorch/Transformers pipeline — pip install squeakycleantext[torch] |
| 62 | +gliner — Zero-shot custom entities (PRODUCT, EVENT, SKILL) — pip install squeakycleantext[gliner] |
| 63 | +ensemble_onnx — ONNX + GLiNER voting — pip install squeakycleantext[gliner] |
| 64 | +ensemble_torch — Torch + GLiNER voting — pip install squeakycleantext[torch,gliner] |
| 65 | +``` |
| 66 | + |
| 67 | +## Supported Languages |
| 68 | + |
| 69 | +``` |
| 70 | +English — rhnfzl/xlm-roberta-large-conll03-english-onnx |
| 71 | +Dutch — rhnfzl/xlm-roberta-large-conll02-dutch-onnx |
| 72 | +German — rhnfzl/xlm-roberta-large-conll03-german-onnx |
| 73 | +Spanish — rhnfzl/xlm-roberta-large-conll02-spanish-onnx |
| 74 | +French / Portuguese / Italian — rhnfzl/wikineural-multilingual-ner-onnx (shared ONNX session) |
| 75 | +``` |
| 76 | + |
| 77 | +## Common Q&A |
| 78 | + |
| 79 | +**Q: How do I anonymize PII in text?** |
| 80 | +A: Set `check_ner_process=True` (default). Returns entities replaced with `<PERSON>`, `<ORGANISATION>`, `<LOCATION>`. |
| 81 | + |
| 82 | +**Q: How do I process texts in a FastAPI route without blocking the event loop?** |
| 83 | +A: Use: `results = await cleaner.aprocess_batch(texts)` |
| 84 | + |
| 85 | +**Q: How do I pre-load models to avoid first-request latency?** |
| 86 | +A: Call `cleaner.warmup(['ENGLISH', 'DUTCH'])` during application startup. |
| 87 | + |
| 88 | +**Q: How do I add French/Portuguese/Italian support?** |
| 89 | +A: Pass `extra_languages=('FRENCH',)` in `TextCleanerConfig`. NER and detection both route via the multilingual model automatically. |
| 90 | + |
| 91 | +**Q: Can I add custom text transformation steps?** |
| 92 | +A: Yes — `custom_pipeline_steps=(my_fn,)` where `my_fn` accepts and returns `str`. Steps run after all built-in steps. |
| 93 | + |
| 94 | +**Q: What's the difference between lm_text and stat_text?** |
| 95 | +A: `lm_text` preserves sentence structure with replacement tokens (for LLM fine-tuning/inference). `stat_text` is lowercased, stopword-free, no punctuation (for TF-IDF, embeddings, classical ML). |
| 96 | + |
| 97 | +## Key Source Files |
| 98 | + |
| 99 | +``` |
| 100 | +sct/sct.py — TextCleaner orchestrator, pipeline assembly |
| 101 | +sct/config.py — TextCleanerConfig frozen dataclass, all defaults |
| 102 | +sct/utils/ner.py — GeneralNER: ensemble NER, lazy loading, ONNX session sharing |
| 103 | +sct/utils/onnx_pipeline.py — ONNXNERPipeline: ONNX inference, BIO aggregation |
| 104 | +sct/utils/constants.py — Pre-compiled regexes (URL, email, phone, date, currency, etc.) |
| 105 | +sct/utils/stopwords.py — Language-aware stopword removal (O(1) set lookup) |
| 106 | +sct/utils/resources.py — Lingua language detector (lazy singleton) |
| 107 | +tests/test_sct.py — Full test suite (hypothesis, faker, pytest-timeout) |
| 108 | +``` |
| 109 | + |
| 110 | +## Resources |
| 111 | + |
| 112 | +- GitHub: https://github.com/rhnfzl/SqueakyCleanText |
| 113 | +- PyPI: https://pypi.org/project/squeakycleantext/ |
| 114 | +- Issues: https://github.com/rhnfzl/SqueakyCleanText/issues |
| 115 | +- Releases: https://github.com/rhnfzl/SqueakyCleanText/releases |
| 116 | +- License: MIT (Rehan Fazal, 2024) |
0 commit comments