Skip to content

Commit 381d815

Browse files
committed
Add AI/search discoverability optimizations
- pyproject.toml: expand description, keywords (4→12), classifiers (7→14), add project URLs - llms.txt: create AI-agent machine-readable index at repo root - sct/__init__.py: add __version__ = "0.5.0" - README.md: add PII/anonymization keywords to intro + llms.txt callout
1 parent 9e7c63b commit 381d815

4 files changed

Lines changed: 137 additions & 5 deletions

File tree

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@
1111
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
1212
</div>
1313

14+
> **Using an AI coding assistant?** This repo includes an [`llms.txt`](./llms.txt) with the full API surface, config reference, and Q&A — optimised for Claude, Cursor, Copilot, and ChatGPT.
15+
1416
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
1517

16-
SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.
18+
SqueakyCleanText simplifies the process by automatically addressing common text issues — removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.
1719

1820
### Key Features
1921

llms.txt

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# SqueakyCleanText
2+
3+
> Production-ready text preprocessing and PII anonymization pipeline for Python 3.11+.
4+
> Removes noise (URLs, emails, phone numbers, stopwords), detects language automatically,
5+
> and runs ensemble NER (ONNX/PyTorch/GLiNER) for entity anonymization.
6+
> Published on PyPI as `squeakycleantext`.
7+
8+
## Quick Start
9+
10+
```
11+
pip install squeakycleantext
12+
```
13+
14+
```python
15+
from sct import TextCleaner, TextCleanerConfig
16+
17+
cleaner = TextCleaner(cfg=TextCleanerConfig(check_ner_process=True))
18+
lm_text, stat_text, lang = cleaner.process("Contact John at john@acme.com or +1-555-123-4567.")
19+
# lm_text: "Contact <PERSON> at <EMAIL> or <PHONE>."
20+
# stat_text: "contact"
21+
# lang: "ENGLISH"
22+
```
23+
24+
## API Surface
25+
26+
- `TextCleaner(cfg=TextCleanerConfig(...))` — main pipeline class
27+
- `cleaner.process(text: str) -> (lm_text, stat_text, language)`
28+
- `cleaner.process_batch(texts: List[str]) -> List[tuple]` — thread-parallel
29+
- `cleaner.aprocess_batch(texts) -> List[tuple]` — async (FastAPI/aiohttp)
30+
- `cleaner.warmup(languages=None)` — pre-load NER models at startup
31+
32+
## TextCleanerConfig — Key Fields
33+
34+
```
35+
check_ner_process=True # NER entity anonymization
36+
check_replace_urls=True # Replace URLs with <URL>
37+
check_replace_emails=True # Replace emails with <EMAIL>
38+
check_replace_phone_numbers=True # Replace phones with <PHONE>
39+
check_replace_dates=False # Replace dates with <DATE>
40+
check_fuzzy_replace_dates=False # Fuzzy misspelled month matching (requires [fuzzy])
41+
check_remove_stopwords=True # Language-aware stopword removal
42+
check_remove_punctuation=True # Punctuation removal (stat output)
43+
check_remove_emoji=False # Remove emoji characters
44+
check_statistical_model_processing=True # Generate stat_text output
45+
46+
ner_backend='onnx' # 'onnx'|'torch'|'gliner'|'ensemble_onnx'|'ensemble_torch'
47+
ner_confidence_threshold=0.85 # Min confidence for entity tagging
48+
ner_batch_size=8 # Inference batch size (must be >= 1)
49+
positional_tags=('PER','LOC','ORG','MISC')
50+
51+
language=None # Pin language (e.g. 'ENGLISH') or None for auto-detect
52+
extra_languages=() # Add languages: 'FRENCH', 'PORTUGUESE', 'ITALIAN'
53+
custom_stopwords=None # {LANG: frozenset({...})}
54+
custom_pipeline_steps=() # Tuple of (text: str) -> str callables
55+
```
56+
57+
## NER Backends
58+
59+
```
60+
onnx (default) — ONNX Runtime, torch-free, ~3-5x faster than PyTorch — base install
61+
torch — PyTorch/Transformers pipeline — pip install squeakycleantext[torch]
62+
gliner — Zero-shot custom entities (PRODUCT, EVENT, SKILL) — pip install squeakycleantext[gliner]
63+
ensemble_onnx — ONNX + GLiNER voting — pip install squeakycleantext[gliner]
64+
ensemble_torch — Torch + GLiNER voting — pip install squeakycleantext[torch,gliner]
65+
```
66+
67+
## Supported Languages
68+
69+
```
70+
English — rhnfzl/xlm-roberta-large-conll03-english-onnx
71+
Dutch — rhnfzl/xlm-roberta-large-conll02-dutch-onnx
72+
German — rhnfzl/xlm-roberta-large-conll03-german-onnx
73+
Spanish — rhnfzl/xlm-roberta-large-conll02-spanish-onnx
74+
French / Portuguese / Italian — rhnfzl/wikineural-multilingual-ner-onnx (shared ONNX session)
75+
```
76+
77+
## Common Q&A
78+
79+
**Q: How do I anonymize PII in text?**
80+
A: Set `check_ner_process=True` (default). Returns entities replaced with `<PERSON>`, `<ORGANISATION>`, `<LOCATION>`.
81+
82+
**Q: How do I process texts in a FastAPI route without blocking the event loop?**
83+
A: Use: `results = await cleaner.aprocess_batch(texts)`
84+
85+
**Q: How do I pre-load models to avoid first-request latency?**
86+
A: Call `cleaner.warmup(['ENGLISH', 'DUTCH'])` during application startup.
87+
88+
**Q: How do I add French/Portuguese/Italian support?**
89+
A: Pass `extra_languages=('FRENCH',)` in `TextCleanerConfig`. NER and detection both route via the multilingual model automatically.
90+
91+
**Q: Can I add custom text transformation steps?**
92+
A: Yes — `custom_pipeline_steps=(my_fn,)` where `my_fn` accepts and returns `str`. Steps run after all built-in steps.
93+
94+
**Q: What's the difference between lm_text and stat_text?**
95+
A: `lm_text` preserves sentence structure with replacement tokens (for LLM fine-tuning/inference). `stat_text` is lowercased, stopword-free, no punctuation (for TF-IDF, embeddings, classical ML).
96+
97+
## Key Source Files
98+
99+
```
100+
sct/sct.py — TextCleaner orchestrator, pipeline assembly
101+
sct/config.py — TextCleanerConfig frozen dataclass, all defaults
102+
sct/utils/ner.py — GeneralNER: ensemble NER, lazy loading, ONNX session sharing
103+
sct/utils/onnx_pipeline.py — ONNXNERPipeline: ONNX inference, BIO aggregation
104+
sct/utils/constants.py — Pre-compiled regexes (URL, email, phone, date, currency, etc.)
105+
sct/utils/stopwords.py — Language-aware stopword removal (O(1) set lookup)
106+
sct/utils/resources.py — Lingua language detector (lazy singleton)
107+
tests/test_sct.py — Full test suite (hypothesis, faker, pytest-timeout)
108+
```
109+
110+
## Resources
111+
112+
- GitHub: https://github.com/rhnfzl/SqueakyCleanText
113+
- PyPI: https://pypi.org/project/squeakycleantext/
114+
- Issues: https://github.com/rhnfzl/SqueakyCleanText/issues
115+
- Releases: https://github.com/rhnfzl/SqueakyCleanText/releases
116+
- License: MIT (Rehan Fazal, 2024)

pyproject.toml

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,31 @@ build-backend = "setuptools.build_meta"
55
[project]
66
name = "SqueakyCleanText"
77
version = "0.5.0"
8-
description = "A comprehensive text cleaning and preprocessing pipeline."
8+
description = "Text preprocessing & PII anonymization pipeline for NLP/ML: ONNX NER ensemble, language detection, stopword removal, and configurable token replacement."
99
readme = "README.md"
1010
license = {text = "MIT"}
1111
authors = [{name = "Rehan Fazal"}]
1212
requires-python = ">=3.11"
13-
keywords = ["text cleaning", "text preprocessing", "NLP", "natural language processing"]
13+
keywords = [
14+
"text cleaning", "text preprocessing", "NLP", "natural language processing",
15+
"named entity recognition", "NER", "anonymization", "PII removal",
16+
"data cleaning", "machine learning", "ONNX", "language detection",
17+
]
1418
classifiers = [
19+
"Development Status :: 5 - Production/Stable",
20+
"Intended Audience :: Developers",
21+
"Intended Audience :: Science/Research",
22+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
23+
"Topic :: Scientific/Engineering :: Information Analysis",
24+
"Topic :: Text Processing :: Linguistic",
25+
"Topic :: Software Development :: Libraries :: Python Modules",
26+
"Natural Language :: English",
1527
"Programming Language :: Python :: 3",
1628
"Programming Language :: Python :: 3.11",
1729
"Programming Language :: Python :: 3.12",
1830
"Programming Language :: Python :: 3.13",
1931
"License :: OSI Approved :: MIT License",
2032
"Operating System :: OS Independent",
21-
"Topic :: Software Development :: Libraries",
22-
"Topic :: Text Processing",
2333
]
2434
dependencies = [
2535
"lingua-language-detector>=2.0.2",
@@ -38,6 +48,9 @@ dependencies = [
3848

3949
[project.urls]
4050
Homepage = "https://github.com/rhnfzl/SqueakyCleanText"
51+
Repository = "https://github.com/rhnfzl/SqueakyCleanText"
52+
Issues = "https://github.com/rhnfzl/SqueakyCleanText/issues"
53+
Changelog = "https://github.com/rhnfzl/SqueakyCleanText/releases"
4154

4255
[project.optional-dependencies]
4356
gpu = [

sct/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@
33
from sct.config import TextCleanerConfig
44
from sct.sct import TextCleaner
55

6+
__version__ = "0.5.0"
67
__all__ = ["TextCleaner", "TextCleanerConfig"]

0 commit comments

Comments
 (0)