Skip to content

TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking)#30

Merged
fsecada01 merged 2 commits into
mainfrom
feature/rust-core-v2
Jun 14, 2026
Merged

TextSpitter 2.0: Rust/PyO3 core (encoding, normalization, token counting, chunking)#30
fsecada01 merged 2 commits into
mainfrom
feature/rust-core-v2

Conversation

@fsecada01

Copy link
Copy Markdown
Owner

Summary

  • Adds a Maturin-based Rust extension (TextSpitter._core) built with PyO3 0.21, replacing the pure-Python processing path with compiled implementations of four primitives:

    • detect_encoding -- single-pass encoding detection via chardetng, replacing the old four-attempt decode loop
    • TextNormalizer -- Unicode NFC/NFD/NFKC/NFKD normalization, whitespace collapse, OCR artifact repair, repeated-header stripping across \f-delimited pages
    • TokenCounter -- BPE token counting via tiktoken-rs 0.5 with end/middle/smart truncation strategies; batch path releases GIL with Rayon
    • TextChunker -- paragraph-aware chunking that respects max_tokens, preserves Markdown tables as atomic units, propagates section titles detected from all-caps headers; batch path releases GIL with Rayon
  • Adds TextSpitter/_fallback.py -- a pure-Python fallback that mirrors the full _core interface using the Python tiktoken package (when available) so callers never need to branch on _RUST_AVAILABLE

  • Exports _RUST_AVAILABLE: bool from TextSpitter.__init__ for conditional feature gating in downstream code

Test plan

  • pytest tests/test_detect_encoding.py -- utf-8, multi-byte, ASCII, Windows-1252, large-buffer stress (Rust + fallback)
  • pytest tests/test_normalizer.py -- Unicode forms, whitespace collapse, OCR repair, header stripping, batch, idempotency (Rust + fallback)
  • pytest tests/test_token_counter.py -- exact counts, batch, truncate end/middle/smart, parallel stress 500 texts (Rust + fallback)
  • pytest tests/test_chunker.py -- field correctness, max_tokens enforcement, oversized-table metadata, section titles, batch, parallel stress 50 texts (Rust + fallback)
  • pytest tests/test_rust_integration.py -- encode->detect->decode roundtrip, normalize->chunk pipeline, token-count consistency, large 50-section document, OCR repair + chunking, Rust<->fallback interface parity
  • pytest (full suite, 239 tests) -- no regressions in existing PDF/DOCX/TXT extraction paths

All 239 tests pass locally on Python 3.13 / Windows 11.

Notes

  • The compiled extension is TextSpitter/_core.pyd (abi3-py310 wheel); added to .gitignore alongside target/ and *.pdb
  • Switched [build-system] from setuptools to maturin>=1.5,<2.0; pyproject.toml now contains [tool.maturin] config with module-name = "TextSpitter._core"
  • Added tiktoken to dev dependencies for the fallback path validation tests
  • DocETL glue (make_parsing_tool, make_chunking_parser) returns list[str] as required by the parsing_tools interface; metadata does not travel through that interface

Generated with Claude Code

Introduces a Maturin-based Rust extension (_core.pyd) that accelerates
encoding detection, text normalization, token counting, and token-aware
chunking, while keeping a pure-Python fallback path (_fallback.py) for
environments where the compiled extension is unavailable.

Rust crates: pyo3 0.21 (Bound API), chardetng, tiktoken-rs 0.5, rayon,
unicode-normalization, regex. All batch methods release the GIL via
py.allow_threads so CPython threads stay unblocked during heavy work.

Python layer changes:
- __init__.py: try-import _core, fall back to _fallback, export
  _RUST_AVAILABLE flag alongside Chunk, TextChunker, TextNormalizer,
  TokenCounter, detect_encoding
- core.py: uses detect_encoding() instead of the old four-attempt loop
- _fallback.py: tiktoken-backed fallback with model validation at
  construction time; word-based middle-truncation when tiktoken absent
- pyproject.toml: switched build-backend to maturin; added tiktoken to
  dev deps; added [tool.maturin] config; expanded .gitignore for Rust

Tests added (239 total, all green):
- test_detect_encoding.py   -- both Rust and fallback paths
- test_normalizer.py        -- Unicode forms, whitespace, OCR, headers
- test_token_counter.py     -- count, batch, truncate, parallel stress
- test_chunker.py           -- field correctness, max_tokens, tables,
                               section titles, batch, parallel stress
- test_rust_integration.py  -- end-to-end pipeline and interface parity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fsecada01 fsecada01 marked this pull request as ready for review June 14, 2026 00:30
Correctness fixes:
- token.rs: implement distinct truncate_smart strategy (2:1 head/tail
  weighting) — was identical to middle strategy
- chunk.rs: always flush on max_tokens overflow; remove min_tokens guard
  that silently allowed chunks to exceed the hard cap
- chunk.rs: fix find_table_end CRLF handling — advance by actual
  terminator byte count (2 for CRLF, 1 for LF) instead of always +1
- encoding.rs: detect UTF-8 BOM before chardetng so decode uses
  utf-8-sig (strips BOM) rather than utf-8 (preserves it)
- normalize.rs: seed strip_headers_footers candidates from all pages,
  not just page 0, so cover-page documents strip running headers

Fallback path fixes (_fallback.py):
- detect_encoding: try cp1252 before latin-1 so Windows smart-quote
  bytes (0x80-0x9F) decode to printable chars, not C1 control chars
- chunk(): oversized paragraphs now set metadata={oversized: True}
  to match the Rust interface contract
- chunk(): use re.split capturing group to track actual separator
  lengths; fixes char_start/char_end drift after 3+-newline gaps
- count/truncate/chunk._count: pass allowed_special=all to tiktoken
  encode() to match Rust encode_with_special_tokens behaviour

core.py:
- _decode_bytes chain updated to (utf-8, cp1252, latin-1) so text/CSV
  files correctly handle Windows-encoded content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fsecada01 fsecada01 self-assigned this Jun 14, 2026
@fsecada01 fsecada01 added enhancement New feature or request v2.0 TextSpitter v2.0 Rust backend rust Rust implementation work testing Test coverage and quality labels Jun 14, 2026
@fsecada01 fsecada01 merged commit e8b25b7 into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request rust Rust implementation work testing Test coverage and quality v2.0 TextSpitter v2.0 Rust backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant