You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Rust badge, revise overview and features table for the v2.0 backend,
update project structure to include src/ and new test modules, lower
prerequisite to Python >= 3.10 (abi3), and mark completed v2.0 roadmap
items (core, fallback, manylinux wheels, chardetng, chunking, Rayon batch).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types — file paths, `BytesIO` streams, `SpooledTemporaryFile` objects, and raw `bytes` — into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
51
+
TextSpitter is a Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types — file paths, `BytesIO` streams, `SpooledTemporaryFile` objects, and raw `bytes` — into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
52
+
53
+
As of **v2.0**, the processing core is written in Rust (via PyO3 + Maturin), delivering 10x–40x batch throughput improvements over the pure-Python v1 implementation. A transparent Python fallback is included for environments where the native extension is unavailable.
51
54
52
55
**Why TextSpitter?**
53
56
54
57
- 📄 **Multi-format extraction** — PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
55
58
- 🔌 **Stream-first API** — accepts file paths, `BytesIO`, `SpooledTemporaryFile`, or raw `bytes`; no temp files required.
59
+
- ⚡ **Rust-powered core** — encoding detection, Unicode normalisation, BPE token counting, and text chunking all run in native code with Rayon parallelism and GIL-released batch methods.
60
+
- 🐍 **Graceful fallback** — pure-Python mirror of every Rust class; `_RUST_AVAILABLE` flag lets callers detect which path is active.
56
61
- 🛠️ **Optional structured logging** — install `textspitter[logging]` to add `loguru`; falls back to stdlib `logging` transparently.
57
62
- 🖥️ **CLI included** — `uv tool install textspitter` gives you a `textspitter` command for quick one-off extractions.
58
-
- 🚀 **Automated CI/CD** — GitHub Actions run the test matrix (Python 3.12–3.14) and publish docs to GitHub Pages on every push.
63
+
- 🚀 **Automated CI/CD** — GitHub Actions run the test matrix (Python 3.12–3.14) and publish multi-platform wheels (Linux, Windows, macOS) to PyPI on every release.
| 🔩 |**Code Quality**| <ul><li>Strict PEP 8 / ruff linting with black formatting</li><li>Full type hints; ships a `py.typed` PEP 561 marker</li></ul> |
71
+
| ⚙️ |**Architecture**| <ul><li>Four-layer design: `TextSpitter` convenience function → `WordLoader` dispatcher → `FileExtractor` reader → Rust `_core` extension</li><li>Transparent Python fallback (`_fallback.py`) when the native extension is unavailable</li></ul> |
72
+
| 🦀 |**Rust Core**| <ul><li>`detect_encoding` — single-pass chardetng encoding detection with UTF-8 BOM handling</li><li>`TextNormalizer` — Unicode NFC/NFD/NFKC/NFKD, whitespace collapse, OCR artifact repair, header/footer stripping</li><li>`TokenCounter` — BPE counting via tiktoken-rs; `count_batch()` releases the GIL via Rayon</li><li>`TextChunker` / `Chunk` — token-aware chunking with table preservation and section detection</li></ul> |
73
+
| 🔩 |**Code Quality**| <ul><li>Strict PEP 8 / ruff linting with black formatting</li><li>Full type hints on both Python and Rust layers; ships a `py.typed` PEP 561 marker</li></ul> |
68
74
| 📄 |**Documentation**| <ul><li>API docs auto-published to GitHub Pages via pdoc</li><li>Quick-start guide, tutorial, use-case examples, and recipes</li></ul> |
69
-
| 🔌 |**Integrations**| <ul><li>CI/CD with GitHub Actions (tests + docs + PyPI publish)</li><li>Package management via `uv`; installable via `pip` or `uv tool install`</li></ul> |
75
+
| 🔌 |**Integrations**| <ul><li>CI/CD with GitHub Actions (tests + docs + multi-platform PyPI publish via maturin-action)</li><li>Package management via `uv`; installable via `pip` or `uv tool install`</li></ul> |
70
76
| 🧩 |**Modularity**| <ul><li>Core `FileExtractor` separated from dispatch logic in `WordLoader`</li><li>Logging abstraction in `logger.py` isolates the optional `loguru` dependency</li></ul> |
71
-
| 🧪 |**Testing**| <ul><li>~70 pytest tests covering all readersand input types</li><li>Dual-mode log capture fixture works with or without `loguru`</li></ul> |
| 🧪 |**Testing**| <ul><li>239 pytest tests covering all readers, Rust classes, and Python fallback paths</li><li>Dual-path test fixtures exercise both `_RUST_AVAILABLE=True` and `False` branches</li></ul> |
78
+
| ⚡️ |**Performance**| <ul><li>10x–40x batch throughput improvement over v1 via Rust + Rayon parallelism</li><li>GIL released on all `*_batch()` methods; Python threads unblocked during Rust work</li></ul> |
0 commit comments