TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking)#30
Merged
Conversation
Introduces a Maturin-based Rust extension (_core.pyd) that accelerates
encoding detection, text normalization, token counting, and token-aware
chunking, while keeping a pure-Python fallback path (_fallback.py) for
environments where the compiled extension is unavailable.
Rust crates: pyo3 0.21 (Bound API), chardetng, tiktoken-rs 0.5, rayon,
unicode-normalization, regex. All batch methods release the GIL via
py.allow_threads so CPython threads stay unblocked during heavy work.
Python layer changes:
- __init__.py: try-import _core, fall back to _fallback, export
_RUST_AVAILABLE flag alongside Chunk, TextChunker, TextNormalizer,
TokenCounter, detect_encoding
- core.py: uses detect_encoding() instead of the old four-attempt loop
- _fallback.py: tiktoken-backed fallback with model validation at
construction time; word-based middle-truncation when tiktoken absent
- pyproject.toml: switched build-backend to maturin; added tiktoken to
dev deps; added [tool.maturin] config; expanded .gitignore for Rust
Tests added (239 total, all green):
- test_detect_encoding.py -- both Rust and fallback paths
- test_normalizer.py -- Unicode forms, whitespace, OCR, headers
- test_token_counter.py -- count, batch, truncate, parallel stress
- test_chunker.py -- field correctness, max_tokens, tables,
section titles, batch, parallel stress
- test_rust_integration.py -- end-to-end pipeline and interface parity
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Correctness fixes:
- token.rs: implement distinct truncate_smart strategy (2:1 head/tail
weighting) — was identical to middle strategy
- chunk.rs: always flush on max_tokens overflow; remove min_tokens guard
that silently allowed chunks to exceed the hard cap
- chunk.rs: fix find_table_end CRLF handling — advance by actual
terminator byte count (2 for CRLF, 1 for LF) instead of always +1
- encoding.rs: detect UTF-8 BOM before chardetng so decode uses
utf-8-sig (strips BOM) rather than utf-8 (preserves it)
- normalize.rs: seed strip_headers_footers candidates from all pages,
not just page 0, so cover-page documents strip running headers
Fallback path fixes (_fallback.py):
- detect_encoding: try cp1252 before latin-1 so Windows smart-quote
bytes (0x80-0x9F) decode to printable chars, not C1 control chars
- chunk(): oversized paragraphs now set metadata={oversized: True}
to match the Rust interface contract
- chunk(): use re.split capturing group to track actual separator
lengths; fixes char_start/char_end drift after 3+-newline gaps
- count/truncate/chunk._count: pass allowed_special=all to tiktoken
encode() to match Rust encode_with_special_tokens behaviour
core.py:
- _decode_bytes chain updated to (utf-8, cp1252, latin-1) so text/CSV
files correctly handle Windows-encoded content
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Maturin-based Rust extension (
TextSpitter._core) built with PyO3 0.21, replacing the pure-Python processing path with compiled implementations of four primitives:detect_encoding-- single-pass encoding detection viachardetng, replacing the old four-attempt decode loopTextNormalizer-- Unicode NFC/NFD/NFKC/NFKD normalization, whitespace collapse, OCR artifact repair, repeated-header stripping across\f-delimited pagesTokenCounter-- BPE token counting viatiktoken-rs 0.5withend/middle/smarttruncation strategies; batch path releases GIL with RayonTextChunker-- paragraph-aware chunking that respectsmax_tokens, preserves Markdown tables as atomic units, propagates section titles detected from all-caps headers; batch path releases GIL with RayonAdds
TextSpitter/_fallback.py-- a pure-Python fallback that mirrors the full_coreinterface using the Pythontiktokenpackage (when available) so callers never need to branch on_RUST_AVAILABLEExports
_RUST_AVAILABLE: boolfromTextSpitter.__init__for conditional feature gating in downstream codeTest plan
pytest tests/test_detect_encoding.py-- utf-8, multi-byte, ASCII, Windows-1252, large-buffer stress (Rust + fallback)pytest tests/test_normalizer.py-- Unicode forms, whitespace collapse, OCR repair, header stripping, batch, idempotency (Rust + fallback)pytest tests/test_token_counter.py-- exact counts, batch, truncate end/middle/smart, parallel stress 500 texts (Rust + fallback)pytest tests/test_chunker.py-- field correctness, max_tokens enforcement, oversized-table metadata, section titles, batch, parallel stress 50 texts (Rust + fallback)pytest tests/test_rust_integration.py-- encode->detect->decode roundtrip, normalize->chunk pipeline, token-count consistency, large 50-section document, OCR repair + chunking, Rust<->fallback interface paritypytest(full suite, 239 tests) -- no regressions in existing PDF/DOCX/TXT extraction pathsAll 239 tests pass locally on Python 3.13 / Windows 11.
Notes
TextSpitter/_core.pyd(abi3-py310 wheel); added to.gitignorealongsidetarget/and*.pdb[build-system]fromsetuptoolstomaturin>=1.5,<2.0;pyproject.tomlnow contains[tool.maturin]config withmodule-name = "TextSpitter._core"tiktokento dev dependencies for the fallback path validation testsmake_parsing_tool,make_chunking_parser) returnslist[str]as required by theparsing_toolsinterface; metadata does not travel through that interfaceGenerated with Claude Code