TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking) by fsecada01 · Pull Request #30 · fsecada01/TextSpitter

fsecada01 · 2026-06-14T00:26:03Z

Summary

Adds a Maturin-based Rust extension (TextSpitter._core) built with PyO3 0.21, replacing the pure-Python processing path with compiled implementations of four primitives:
- detect_encoding -- single-pass encoding detection via chardetng, replacing the old four-attempt decode loop
- TextNormalizer -- Unicode NFC/NFD/NFKC/NFKD normalization, whitespace collapse, OCR artifact repair, repeated-header stripping across \f-delimited pages
- TokenCounter -- BPE token counting via tiktoken-rs 0.5 with end/middle/smart truncation strategies; batch path releases GIL with Rayon
- TextChunker -- paragraph-aware chunking that respects max_tokens, preserves Markdown tables as atomic units, propagates section titles detected from all-caps headers; batch path releases GIL with Rayon
Adds TextSpitter/_fallback.py -- a pure-Python fallback that mirrors the full _core interface using the Python tiktoken package (when available) so callers never need to branch on _RUST_AVAILABLE
Exports _RUST_AVAILABLE: bool from TextSpitter.__init__ for conditional feature gating in downstream code

Test plan

pytest tests/test_detect_encoding.py -- utf-8, multi-byte, ASCII, Windows-1252, large-buffer stress (Rust + fallback)
pytest tests/test_normalizer.py -- Unicode forms, whitespace collapse, OCR repair, header stripping, batch, idempotency (Rust + fallback)
pytest tests/test_token_counter.py -- exact counts, batch, truncate end/middle/smart, parallel stress 500 texts (Rust + fallback)
pytest tests/test_chunker.py -- field correctness, max_tokens enforcement, oversized-table metadata, section titles, batch, parallel stress 50 texts (Rust + fallback)
pytest tests/test_rust_integration.py -- encode->detect->decode roundtrip, normalize->chunk pipeline, token-count consistency, large 50-section document, OCR repair + chunking, Rust<->fallback interface parity
pytest (full suite, 239 tests) -- no regressions in existing PDF/DOCX/TXT extraction paths

All 239 tests pass locally on Python 3.13 / Windows 11.

Notes

The compiled extension is TextSpitter/_core.pyd (abi3-py310 wheel); added to .gitignore alongside target/ and *.pdb
Switched [build-system] from setuptools to maturin>=1.5,<2.0; pyproject.toml now contains [tool.maturin] config with module-name = "TextSpitter._core"
Added tiktoken to dev dependencies for the fallback path validation tests
DocETL glue (make_parsing_tool, make_chunking_parser) returns list[str] as required by the parsing_tools interface; metadata does not travel through that interface

Generated with Claude Code

Introduces a Maturin-based Rust extension (_core.pyd) that accelerates encoding detection, text normalization, token counting, and token-aware chunking, while keeping a pure-Python fallback path (_fallback.py) for environments where the compiled extension is unavailable. Rust crates: pyo3 0.21 (Bound API), chardetng, tiktoken-rs 0.5, rayon, unicode-normalization, regex. All batch methods release the GIL via py.allow_threads so CPython threads stay unblocked during heavy work. Python layer changes: - __init__.py: try-import _core, fall back to _fallback, export _RUST_AVAILABLE flag alongside Chunk, TextChunker, TextNormalizer, TokenCounter, detect_encoding - core.py: uses detect_encoding() instead of the old four-attempt loop - _fallback.py: tiktoken-backed fallback with model validation at construction time; word-based middle-truncation when tiktoken absent - pyproject.toml: switched build-backend to maturin; added tiktoken to dev deps; added [tool.maturin] config; expanded .gitignore for Rust Tests added (239 total, all green): - test_detect_encoding.py -- both Rust and fallback paths - test_normalizer.py -- Unicode forms, whitespace, OCR, headers - test_token_counter.py -- count, batch, truncate, parallel stress - test_chunker.py -- field correctness, max_tokens, tables, section titles, batch, parallel stress - test_rust_integration.py -- end-to-end pipeline and interface parity Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Correctness fixes: - token.rs: implement distinct truncate_smart strategy (2:1 head/tail weighting) — was identical to middle strategy - chunk.rs: always flush on max_tokens overflow; remove min_tokens guard that silently allowed chunks to exceed the hard cap - chunk.rs: fix find_table_end CRLF handling — advance by actual terminator byte count (2 for CRLF, 1 for LF) instead of always +1 - encoding.rs: detect UTF-8 BOM before chardetng so decode uses utf-8-sig (strips BOM) rather than utf-8 (preserves it) - normalize.rs: seed strip_headers_footers candidates from all pages, not just page 0, so cover-page documents strip running headers Fallback path fixes (_fallback.py): - detect_encoding: try cp1252 before latin-1 so Windows smart-quote bytes (0x80-0x9F) decode to printable chars, not C1 control chars - chunk(): oversized paragraphs now set metadata={oversized: True} to match the Rust interface contract - chunk(): use re.split capturing group to track actual separator lengths; fixes char_start/char_end drift after 3+-newline gaps - count/truncate/chunk._count: pass allowed_special=all to tiktoken encode() to match Rust encode_with_special_tokens behaviour core.py: - _decode_bytes chain updated to (utf-8, cp1252, latin-1) so text/CSV files correctly handle Windows-encoded content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fsecada01 marked this pull request as ready for review June 14, 2026 00:30

fsecada01 self-assigned this Jun 14, 2026

fsecada01 added enhancement New feature or request v2.0 TextSpitter v2.0 Rust backend rust Rust implementation work testing Test coverage and quality labels Jun 14, 2026

fsecada01 merged commit e8b25b7 into main Jun 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking)#30

TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking)#30
fsecada01 merged 2 commits into
mainfrom
feature/rust-core-v2

fsecada01 commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fsecada01 commented Jun 14, 2026

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant