Optimize: add Cython extensions for is_thai_char, is_thai, count_thai (2.2x–11.0x)#1394
Optimize: add Cython extensions for is_thai_char, is_thai, count_thai (2.2x–11.0x)#1394chanitnan0jr wants to merge 20 commits intoPyThaiNLP:devfrom
Conversation
Provide compiled C extensions for is_thai_char, is_thai, count_thai (pythainlp._ext._thai_fast) and remove_tonemark (pythainlp._ext._normalize_fast). The extensions are loaded at import time with a pure-Python fallback when the compiled modules are absent, so the change is backward-compatible and does not affect builds without a C compiler. The remove_tonemark implementation filters tone marks directly in UTF-8 byte space using typed memory views, avoiding per-character Python object allocation. Benchmarks on CPython 3.12 show speedups of 2.2x (is_thai_char), 6.8x (is_thai), 10.4x (count_thai), and 1.6x (remove_tonemark) over the corresponding pure-Python implementations.
- Remove unused imports: os, statistics - Replace bare except Exception: pass with except OSError in _get_cpu_model - Annotate func_py, func_cy, func as Callable[..., object] instead of object to satisfy type checker and remove # type: ignore[operator] suppressions - Remove unused label parameter from profile_function and its call sites
Extract the repeated load_tests function body into a shared factory make_load_tests() in tests/_noauto_loader.py. Each noauto __init__.py now declares only its test_packages list and calls make_load_tests(). Eliminates 5 identical 15-line blocks (noauto_cython, noauto_network, noauto_onnx, noauto_tensorflow, noauto_torch).
There was a problem hiding this comment.
Pull request overview
This PR introduces optional Cython extensions for hot-path Thai utility functions to improve performance while keeping pure-Python fallbacks, and adds supporting tests/benchmarks plus minor test-suite refactoring.
Changes:
- Add Cython extensions (and type stubs) for
is_thai_char,is_thai,count_thai, with runtime fallback wiring inpythainlp.util.thai. - Add Cython-focused correctness/performance tests and a reproducible benchmark + cProfile evidence script.
- Deduplicate
noauto_*unittestload_testsboilerplate via a shared loader factory.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pythainlp/_ext/_thai_fast.pyx |
New Cython implementations for Thai character utilities (auto-loaded). |
pythainlp/_ext/_thai_fast.pyi |
Type stubs for the _thai_fast extension. |
pythainlp/_ext/_normalize_fast.pyx |
New Cython normalization functions (intended for direct import/testing). |
pythainlp/_ext/_normalize_fast.pyi |
Type stubs for the _normalize_fast extension. |
pythainlp/_ext/__init__.py |
Marks _ext as the optional extensions package. |
pythainlp/util/thai.py |
Keeps _py_* references and conditionally overrides with Cython fast-paths. |
pythainlp/util/normalize.py |
Simplifies remove_tonemark loop and preserves _py_* references (no auto-override). |
tests/noauto_cython/testn_fast_functions.py |
New correctness + performance assertions for the fast functions. |
tests/noauto_cython/__init__.py |
Registers the new noauto-cython test module via shared loader. |
tests/_noauto_loader.py |
New shared make_load_tests() factory for noauto suites. |
tests/noauto_torch/__init__.py |
Switches to shared noauto loader. |
tests/noauto_tensorflow/__init__.py |
Switches to shared noauto loader. |
tests/noauto_onnx/__init__.py |
Switches to shared noauto loader. |
tests/noauto_network/__init__.py |
Switches to shared noauto loader. |
scripts/bench_full_evidence.py |
New end-to-end benchmark + cProfile evidence script. |
pyproject.toml |
Adds hatch-cython/Cython build configuration and mypy override for _ext. |
.gitignore |
Ignores generated C files under pythainlp/**. |
| Unlike the pure-Python implementation (which raises ``TypeError`` | ||
| for empty or multi-character strings via ``ord()``), this | ||
| implementation returns ``False`` for any input whose length is | ||
| not exactly 1. | ||
| """ | ||
| if len(ch) != 1: | ||
| return False |
There was a problem hiding this comment.
is_thai_char() no longer matches the pure-Python behavior for invalid input lengths. The Python implementation calls ord(ch) and raises TypeError for ch == "" or multi-character strings, but this Cython version returns False when len(ch) != 1. Because pythainlp.util.thai auto-overrides to the Cython implementation when available, this becomes a user-visible API behavior change. Consider raising TypeError with the same message as ord() when len(ch) != 1 (slow path), so the optimized path stays compatible for valid inputs.
| Unlike the pure-Python implementation (which raises ``TypeError`` | |
| for empty or multi-character strings via ``ord()``), this | |
| implementation returns ``False`` for any input whose length is | |
| not exactly 1. | |
| """ | |
| if len(ch) != 1: | |
| return False | |
| This function mirrors the pure-Python implementation: a | |
| :class:`TypeError` is raised (with the same message as | |
| :func:`ord`) if ``ch`` is an empty string or has length other | |
| than 1. | |
| """ | |
| if len(ch) != 1: | |
| raise TypeError( | |
| "ord() expected a character, but string of length %d found" | |
| % len(ch) | |
| ) |
pythainlp/util/thai.py
Outdated
| count_thai, # noqa: F811 | ||
| is_thai, # noqa: F811 | ||
| is_thai_char, # noqa: F811 | ||
| ) | ||
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
This override makes the Cython implementation the public is_thai_char/is_thai/count_thai at runtime. Since _thai_fast.is_thai_char() currently returns False for empty/multi-character strings (instead of raising TypeError like the pure-Python ord(ch) implementation), importing the extension changes user-visible behavior depending on whether the extension is built. To keep backward compatibility, either keep a small Python wrapper that preserves the original exception behavior, or update the Cython function to raise the same TypeError as ord() for invalid lengths.
| count_thai, # noqa: F811 | |
| is_thai, # noqa: F811 | |
| is_thai_char, # noqa: F811 | |
| ) | |
| except ImportError: | |
| pass | |
| count_thai as _fast_count_thai, | |
| is_thai as _fast_is_thai, | |
| is_thai_char as _fast_is_thai_char, | |
| ) | |
| except ImportError: | |
| pass | |
| else: | |
| # Use fast implementations for is_thai and count_thai directly. | |
| # For is_thai_char, wrap the fast implementation so that we preserve | |
| # the original TypeError behavior of the pure-Python version (which | |
| # relies on ord(ch)) for invalid input lengths. | |
| count_thai = _fast_count_thai # noqa: F811 | |
| is_thai = _fast_is_thai # noqa: F811 | |
| def is_thai_char(ch: str) -> bool: # noqa: F811 | |
| # ord(ch) will raise the same TypeError as the original implementation | |
| # for empty strings or strings of length != 1, preserving behavior. | |
| _ = ord(ch) | |
| return _fast_is_thai_char(ch) |
pythainlp/_ext/_normalize_fast.pyx
Outdated
| pythainlp.util.normalize and are loaded as transparent replacements when the | ||
| Cython extension is available. |
There was a problem hiding this comment.
The module docstring says these normalization functions “are loaded as transparent replacements when the Cython extension is available”, but pythainlp.util.normalize explicitly does not auto-load Cython overrides (it only keeps _py_* references). Please update the docstring to match the actual behavior to avoid misleading users.
| pythainlp.util.normalize and are loaded as transparent replacements when the | |
| Cython extension is available. | |
| pythainlp.util.normalize and can be used as faster drop-in replacements | |
| when explicitly imported. |
| def _speedup(self, py_func, cy_func, arg: str, n: int = 5000) -> float: | ||
| py_time = timeit.timeit(lambda: py_func(arg), number=n) | ||
| cy_time = timeit.timeit(lambda: cy_func(arg), number=n) | ||
| return py_time / cy_time | ||
|
|
||
| def test_is_thai_char_faster(self) -> None: | ||
| from pythainlp.util.thai import _py_is_thai_char as py_is_thai_char | ||
|
|
||
| sample = "ก" | ||
| speedup = self._speedup(py_is_thai_char, fast_is_thai_char, sample) | ||
| self.assertGreater( | ||
| speedup, | ||
| 1.2, | ||
| f"is_thai_char speedup {speedup:.1f}x is less than 1.2x", | ||
| ) |
There was a problem hiding this comment.
These tests assert a minimum speedup (1.2×) in a unit-test suite. Performance assertions tend to be flaky across CI runners / CPU governors / debug builds, and can fail even when the optimization is correct (especially for short inputs or noisy environments). Consider moving speed checks to a benchmark-only script (like scripts/bench_full_evidence.py), or relaxing them to non-failing diagnostics (e.g., log/skip when below threshold) so correctness tests remain stable.
|
It look interesting. I thinking do this for newmm tokenizer too. How do you think? @bact |
A valuable contribution. Thanks @chanitnan0jr Reading reviews from Copilot, I think we need to be explicit about the expected input and output -- so the Cython implementation can follow the expectation. Like how to deal with wrong type or wrong range. Raise TypeError/ValueError, or trying to resolve it in a sensible and consistent way (put it as 0 or None or empty string or empty list, etc). This is a design choice. This also need a lift in publication workflow as well. If Cython is implemented, we may need to distribute binary wheels for platforms too. Need some work, but once done it should just runs (will use more resource as well, have to plan). The multi-platform publication can be done by cibuildwheel. |
The Cython is_thai_char returns False for empty/multi-character strings, but the pure-Python version raises TypeError (via ord()) for any input whose length != 1. Because thai.py auto-overrides to the Cython path, this was a user-visible API behavior change. Fix: import all three Cython functions under _fast_* aliases, then explicitly assign count_thai and is_thai as module-level overrides. For is_thai_char, wrap with a Python function that calls ord(ch) first — ord() raises TypeError with the same message as the original for any invalid-length input — then delegates to _fast_is_thai_char for valid single-character inputs. Ref: PyThaiNLP#1394 (comment)
Move the count_thai, is_thai, and is_thai_char overrides from the try block into the else clause so that assignments only execute when the import succeeds the idiomatic Python pattern for this structure. No behavior change; purely a structural improvement per maintainer review at: PyThaiNLP#1394 (comment)
Replace PEP 604 union syntax (Callable[..., object] | None) with Optional[Callable[..., object]] from typing, which is supported from Python 3.9 — the project minimum version for PyThaiNLP 5.x. Ref: PyThaiNLP#1394 (comment)
…-only usage The module docstring said functions are "loaded as transparent replacements when the Cython extension is available", but normalize.py intentionally does not auto-load Cython overrides callers must import them directly. Update to: "can be used as faster drop-in replacements when explicitly imported." Ref: PyThaiNLP#1394 (comment)
FastFunctionPerformanceTest asserted a minimum speedup (1.2x) which fails on CI runners, CPU governors, or debug builds even when the optimization is correct. Correctness is already covered by existing test_*_matches_python tests in FastThaiCorrectnessTest and FastNormalizeCorrectnessTest. Remove the entire FastFunctionPerformanceTest class and the timeit import. Performance evidence is reproducible via the dedicated scripts/bench_full_evidence.py benchmark script. Ref: PyThaiNLP#1394 (comment)
Rename the inline wrapper def to _is_thai_char_fast (a new name), then assign is_thai_char = _is_thai_char_fast with noqa: F811. The def itself no longer shadows the earlier function definition, eliminating the "function already defined" lint error.
Thank you for the thorough review the feedback genuinely improved the PR. Here's a summary of what was updated:Backward compatibilityis_thai_char now preserves the original TypeError behavior (empty string / multi-character input) via a Python wrapper in thai.py that delegates ord(ch) validation before calling the Cython path. Error message and exception type are identical to the pure-Python version. (38f289e5) Test stabilityRemoved the 1.2× speedup assertions — they were flaky across CI runners and CPU governors. Correctness is still strictly verified against the pure-Python baseline. Performance evidence lives in scripts/bench_full_evidence.py where it belongs. (1fa09706) Compatibility & docsReplaced PEP 604 | union syntax with typing.Optional in the benchmark script for Python 3.9 support. (1d4e811e) @wannaphong I agree that Cython optimizations for newmm are the logical next step. I’m looking forward to exploring that in a follow up PR after this foundational work is merged. @bact I fully support cibuildwheel is the right path forward for binary distribution. Keeping the build optional for now ensures a stable transition for the current workflow, but I’m ready to assist with the automated publishing pipeline whenever the team is ready to move in that direction PYTHONPATH=. python3 scripts/bench_full_evidence.py)Environment
|
typing.Callable is deprecated since Python 3.9 (PEP 585). Split into collections.abc.Callable (for the type) and typing.Optional (still needed for Optional[...] syntax on Python 3.9). Ref: https://peps.python.org/pep-0585/
F811 (Redefinition of unused name from import) only fires when a name from an import statement is redefined. The assignments in the else block (count_thai = _fast_count_thai, etc.) redefine names originally created by def statements, not imports — so F811 never fires on these lines. The unused noqa directives were triggering RUF100 (unused noqa directive), causing the ruff CI check to fail.
|
Three additional fixes pushed:
Regarding the remaining CI test failures ( The root cause appears to be a |
- Add `build_wheels` job in `pypi-publish.yml` to build OS matrices over cibuildwheel. - Split `twine check` validation logic correctly across platforms using multi-line. - Downgrade Github Action versions to safe latest variables to correct CI errors. - Document and establish `pyproject.toml` parameters for Linux, macOS, and Windows. - Condense the wheel test command cross-compatible for PowerShell.
- Pin `actions/checkout`, `actions/setup-python`, `actions/upload-artifact`, `actions/download-artifact` - Pin `pypa/cibuildwheel` and `pypa/gh-action-pypi-publish` - Resolves Security Hotspot githubactions:S7637 by preventing unverified mutable tag attacks. - Keep readable version tags as inline comments for maintainability.
…tch-cython from compiling pure Python scripts
|
Update: Correction on the CI failures I need to correct my previous comment and apologize for the confusion. It turns out the test failures weren't a pre-existing bug after all, but rather a configuration issue with the new build setup in this PR. What happened: The fix: I have verified the fix locally and the tests are passing 100%. The CI should be fully green now alongside the wheel builds. Again, I sincerely apologize for jumping to conclusions earlier about the bug. Thank you for your patience while I sorted this out. cc @bact @wannaphong, ready for your review whenever you have time. Thanks! |
…parameters to optimize NLP tokenizer loops
|



Description
This PR adds optional Cython extensions for three hot-path utility
functions, achieving 2.2x–11.0x speedups while maintaining full
backward compatibility.
When Cython is unavailable, the library falls back to the existing
pure-Python implementations automatically
Changes
New files
pythainlp/_ext/_thai_fast.pyxis_thai_char,is_thai,count_thai— auto-loadedpythainlp/_ext/_normalize_fast.pyxremove_tonemark,remove_dup_spaces— reference only, not auto-loaded (see rationale below)pythainlp/_ext/*.pyipythainlp/_ext/__init__.pytests/noauto_cython/testn_fast_functions.pyscripts/bench_full_evidence.pyModified files
pyproject.tomlpythainlp/util/thai.pyis_thai_char,is_thai,count_thaipythainlp/util/normalize.py_py_*references saved before any future overridetests/noauto_cython/__init__.py.gitignorepythainlp/**/*.c(Cython-generated C files)Motivation
is_thai_char,is_thai, andcount_thaiare called millions of timesin text-processing pipelines. cProfile shows
count_thaialone accountsfor 99.97% of cumulative time (127.3 s / 127.4 s) in
character-counting workloads, making it the primary bottleneck.
Methodology
_thai_fast.pyx— what makes it fasterord(ch)→ Python int comparisonPy_UCS4→ C-leveluint32_tcomparisonforloop compiled to machine codech in ignore_chars(Pythonstr)c in ignore_chars(Cython C-level)Py_ssize_tthroughoutboundscheck=False,wraparound=FalseWhy
_normalize_fast.pyxis not auto-loadedPython's
str.replace()operates on CPython's internal string bufferin pure C without any encode/decode round-trip. The Cython
implementation encodes to UTF-8, filters at the byte level, then
decodes — adding overhead that varies significantly with tone-mark
density. Rather than shipping a function that is faster on some inputs
and slower on others, we keep it in
_extfor direct import and leavethe decision to callers who have profiled their specific workloads.
The
_py_*references innormalize.pyensure the pure-Pythonbaseline remains importable alongside any future override.
Benchmark Evidence
Environment
Dataset
Real Thai prose constructed from Thai Wikipedia-style text:
Results
is_thai_char— single character check, 1 M calls:is_thai— scales with text length:count_thai— speedup scales with data size:cProfile hotspot analysis
Before:
count_thai(Python), 100 K calls × ~6 K chars:count_thaiconsumes 99.97% of total time (127.33 s / 127.36 s).After:
count_thai(Cython), same workload:Bottleneck completely eliminated. The Cython function is invisible to
cProfile — all 100 K calls and inner loops execute entirely in compiled C.
Testing
python -m unittest tests.noauto_cython)pip install -e . --no-build-isolationsucceeds in a clean venvPYTHONPATH=. python3 scripts/bench_full_evidence.pyChecklist
--line-length=79).pyi) provided for mypy strict modeoptional = true— installations without a C compiler are unaffected_py_*references saved so Python baselines remain importableFAQ Technical Considerations