Skip to content

Optimize: add Cython extensions for is_thai_char, is_thai, count_thai (2.2x–11.0x)#1394

Open
chanitnan0jr wants to merge 20 commits intoPyThaiNLP:devfrom
chanitnan0jr:dev
Open

Optimize: add Cython extensions for is_thai_char, is_thai, count_thai (2.2x–11.0x)#1394
chanitnan0jr wants to merge 20 commits intoPyThaiNLP:devfrom
chanitnan0jr:dev

Conversation

@chanitnan0jr
Copy link
Copy Markdown

@chanitnan0jr chanitnan0jr commented Apr 2, 2026

Description

This PR adds optional Cython extensions for three hot-path utility
functions, achieving 2.2x–11.0x speedups while maintaining full
backward compatibility.

When Cython is unavailable, the library falls back to the existing
pure-Python implementations automatically

Changes

New files

File Purpose
pythainlp/_ext/_thai_fast.pyx Cython: is_thai_char, is_thai, count_thaiauto-loaded
pythainlp/_ext/_normalize_fast.pyx Cython: remove_tonemark, remove_dup_spaces — reference only, not auto-loaded (see rationale below)
pythainlp/_ext/*.pyi Type stubs for mypy strict mode
pythainlp/_ext/__init__.py Package marker
tests/noauto_cython/testn_fast_functions.py 15 correctness + performance tests
scripts/bench_full_evidence.py Reproducible benchmark script

Modified files

File What changed
pyproject.toml hatch-cython hook config, mypy overrides, setuptools flat-layout fix
pythainlp/util/thai.py Fallback import for is_thai_char, is_thai, count_thai
pythainlp/util/normalize.py _py_* references saved before any future override
tests/noauto_cython/__init__.py Register new test module
.gitignore Exclude pythainlp/**/*.c (Cython-generated C files)

Motivation

is_thai_char, is_thai, and count_thai are called millions of times
in text-processing pipelines. cProfile shows count_thai alone accounts
for 99.97% of cumulative time (127.3 s / 127.4 s) in
character-counting workloads, making it the primary bottleneck.


Methodology

_thai_fast.pyx — what makes it faster

Technique Python (before) Cython (after)
Character check ord(ch) → Python int comparison Py_UCS4 → C-level uint32_t comparison
Loop overhead Python bytecode interpreter C for loop compiled to machine code
Membership test ch in ignore_chars (Python str) c in ignore_chars (Cython C-level)
Size types Py_ssize_t throughout
Directives boundscheck=False, wraparound=False

Why _normalize_fast.pyx is not auto-loaded

Python's str.replace() operates on CPython's internal string buffer
in pure C without any encode/decode round-trip. The Cython
implementation encodes to UTF-8, filters at the byte level, then
decodes — adding overhead that varies significantly with tone-mark
density. Rather than shipping a function that is faster on some inputs
and slower on others, we keep it in _ext for direct import and leave
the decision to callers who have profiled their specific workloads.

The _py_* references in normalize.py ensure the pure-Python
baseline remains importable alongside any future override.


Benchmark Evidence

Environment

OS           : Linux 6.8.0-100-generic
Architecture : x86_64
CPU          : AMD Ryzen 5 5600H with Radeon Graphics
Python       : 3.12.3 (GCC 13.3.0)
pythainlp    : 5.3.3
Cython ext   : loaded (compiled)

Dataset

Real Thai prose constructed from Thai Wikipedia-style text:

Scale Size Description
Short 10 chars Single greeting
Medium ~295 chars Paragraph (5× sentence)
Long ~12,800 chars Article (50× paragraph)
Huge ~128,000 chars Corpus batch (500× paragraph)

Results

is_thai_char — single character check, 1 M calls:

Input Python Cython Speedup
1 M calls 0.0899 s 0.0418 s 2.2×

is_thai — scales with text length:

Input Python Cython Speedup
10 chars (500 K calls) 0.2561 s 0.0432 s 5.9×
~310 chars (100 K calls) 0.1328 s 0.0187 s 7.1×
~6 K chars (10 K calls) 0.0174 s 0.0023 s 7.6×
~60 K chars (1 K calls) 0.0017 s 0.0002 s 7.4×

count_thai — speedup scales with data size:

Input Python Cython Speedup
10 chars (500 K calls) 0.3422 s 0.0456 s 7.5×
~310 chars (50 K calls) 0.7042 s 0.0746 s 9.4×
~6 K chars (5 K calls) 3.4278 s 0.3378 s 10.1×
~60 K chars (500 calls) 3.7015 s 0.3358 s 11.0×

Key insight: Speedup scales with input length because Python
interpreter overhead grows linearly while the Cython C loop overhead
is constant.

cProfile hotspot analysis

Before: count_thai (Python), 100 K calls × ~6 K chars:

300,001 function calls in 127.360 seconds

ncalls  tottime  percall  cumtime  percall  filename
100000  127.330  0.001    127.360  0.001    thai.py:181(count_thai)
100000    0.018  0.000      0.018  0.000    builtins.len
100000    0.012  0.000      0.012  0.000    builtins.isinstance

count_thai consumes 99.97% of total time (127.33 s / 127.36 s).

After: count_thai (Cython), same workload:

1 function call in 0.000 seconds

Bottleneck completely eliminated. The Cython function is invisible to
cProfile — all 100 K calls and inner loops execute entirely in compiled C.


Testing

  • 15/15 Cython-specific tests pass (python -m unittest tests.noauto_cython)
  • 209 core tests pass, zero regressions
  • pip install -e . --no-build-isolation succeeds in a clean venv
  • Graceful fallback verified (all functions work without Cython)
  • Benchmark reproducible: PYTHONPATH=. python3 scripts/bench_full_evidence.py

Checklist

  • PEP 8, Black (--line-length=79)
  • Meaningful identifier names
  • All source files end with one empty line
  • ruff passes with zero errors
  • Type stubs (.pyi) provided for mypy strict mode
  • optional = true — installations without a C compiler are unaffected
  • _py_* references saved so Python baselines remain importable

FAQ Technical Considerations

  • ทำไมถึงเลือกใช้ Cython แทนที่จะทำใน nlpo3 (Rust)?
    • ฟังก์ชันกลุ่ม Character Utility เป็นพื้นฐานสำคัญ จึงควรอยู่คู่กับ Library หลัก เพื่อลดภาระในการจัดการ Optional Dependency ภายนอกให้กับผู้ใช้งานทั่วไป
    • Low Maintenance Barrier: โค้ด Cython (.pyx) มีโครงสร้างใกล้เคียงกับ Python มาก ทำให้ Maintainer และ Contributor ส่วนใหญ่สามารถอ่านและดูแลรักษา (Maintain) ได้ง่ายกว่าการข้ามไปใช้ภาษา Rust
    • FFI Overhead Reduction: เนื่องจากฟังก์ชันเหล่านี้เป็นงานขนาดเล็กแต่ถูกเรียกใช้บ่อยครั้ง (High-frequency, Low-logic) การใช้ Cython ที่รันอยู่บน CPython Runtime เดียวกันจะช่วยลด overhead ในการเรียกข้ามภาษา (FFI) ได้ดีกว่าการเรียกไปยัง External library
    • Domain Focus: ปัจจุบัน nlpo3 มุ่งเน้นไปที่การทำ Tokenization ขนาดใหญ่ แต่ส่วนที่เพิ่มเติมนี้เน้นไปที่ Character-level utilities ที่ nlpo3 ยังไม่ครอบคลุม
  • การเพิ่ม C-extension จะทำให้ pip install บน Windows พังหรือไม่?
    • Graceful Fallback: ไม่พัง เนื่องจากเรากำหนดค่า optional = true ไว้ใน pyproject.toml (ผ่าน hatch-cython build hook)
    • Zero-risk Installation: หากสภาพแวดล้อมของผู้ใช้งานไม่มี C Compiler หรือเครื่องมือการ Build ระบบจะทำการ Fallback กลับไปเรียกใช้ Pure Python implementation โดยอัตโนมัติ ผู้ใช้จะไม่ได้รับผลกระทบใดๆ นอกจากความเร็วที่กลับไปเท่าเดิม
  • การทำแบบนี้จะเพิ่มภาระในการดูแลรักษา (Maintenance Burden) หรือไม่?
    • Mirror Implementation: โค้ดในส่วน Cython ถูกเขียนให้เป็น Mirror (1:1) กับ Python version ทำให้การตรวจสอบความถูกต้องทำได้ง่าย
    • Type Safety: มีการจัดทำไฟล์ .pyi (Type Stubs) เพื่อรองรับ Static Type Checking (mypy/pyright) และเพื่อให้ IDE แสดงผลได้อย่างถูกต้องเหมือนเดิม
    • Unified Testing: ใช้ Test suite ชุดเดียวกันในการยืนยันความถูกต้องของทั้งสองเวอร์ชัน เพื่อให้มั่นใจว่าผลลัพธ์ที่ได้จะไม่มีความผิดเพี้ยนไปจากเดิม (Zero Regressions)

Provide compiled C extensions for is_thai_char, is_thai, count_thai
(pythainlp._ext._thai_fast) and remove_tonemark
(pythainlp._ext._normalize_fast).  The extensions are loaded at import
time with a pure-Python fallback when the compiled modules are absent,
so the change is backward-compatible and does not affect builds without
a C compiler.

The remove_tonemark implementation filters tone marks directly in
UTF-8 byte space using typed memory views, avoiding per-character
Python object allocation.  Benchmarks on CPython 3.12 show speedups of
2.2x (is_thai_char), 6.8x (is_thai), 10.4x (count_thai), and 1.6x
(remove_tonemark) over the corresponding pure-Python implementations.
- Remove unused imports: os, statistics
- Replace bare except Exception: pass with except OSError in _get_cpu_model
- Annotate func_py, func_cy, func as Callable[..., object] instead of object
  to satisfy type checker and remove # type: ignore[operator] suppressions
- Remove unused label parameter from profile_function and its call sites
Extract the repeated load_tests function body into a shared factory
make_load_tests() in tests/_noauto_loader.py. Each noauto __init__.py
now declares only its test_packages list and calls make_load_tests().

Eliminates 5 identical 15-line blocks (noauto_cython, noauto_network,
noauto_onnx, noauto_tensorflow, noauto_torch).
@bact bact added the enhancement enhance functionalities label Apr 2, 2026
@bact bact added this to the 6.0 milestone Apr 2, 2026
@bact bact added this to PyThaiNLP Apr 2, 2026
@bact bact moved this to In progress in PyThaiNLP Apr 2, 2026
@bact bact requested a review from Copilot April 2, 2026 12:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces optional Cython extensions for hot-path Thai utility functions to improve performance while keeping pure-Python fallbacks, and adds supporting tests/benchmarks plus minor test-suite refactoring.

Changes:

  • Add Cython extensions (and type stubs) for is_thai_char, is_thai, count_thai, with runtime fallback wiring in pythainlp.util.thai.
  • Add Cython-focused correctness/performance tests and a reproducible benchmark + cProfile evidence script.
  • Deduplicate noauto_* unittest load_tests boilerplate via a shared loader factory.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pythainlp/_ext/_thai_fast.pyx New Cython implementations for Thai character utilities (auto-loaded).
pythainlp/_ext/_thai_fast.pyi Type stubs for the _thai_fast extension.
pythainlp/_ext/_normalize_fast.pyx New Cython normalization functions (intended for direct import/testing).
pythainlp/_ext/_normalize_fast.pyi Type stubs for the _normalize_fast extension.
pythainlp/_ext/__init__.py Marks _ext as the optional extensions package.
pythainlp/util/thai.py Keeps _py_* references and conditionally overrides with Cython fast-paths.
pythainlp/util/normalize.py Simplifies remove_tonemark loop and preserves _py_* references (no auto-override).
tests/noauto_cython/testn_fast_functions.py New correctness + performance assertions for the fast functions.
tests/noauto_cython/__init__.py Registers the new noauto-cython test module via shared loader.
tests/_noauto_loader.py New shared make_load_tests() factory for noauto suites.
tests/noauto_torch/__init__.py Switches to shared noauto loader.
tests/noauto_tensorflow/__init__.py Switches to shared noauto loader.
tests/noauto_onnx/__init__.py Switches to shared noauto loader.
tests/noauto_network/__init__.py Switches to shared noauto loader.
scripts/bench_full_evidence.py New end-to-end benchmark + cProfile evidence script.
pyproject.toml Adds hatch-cython/Cython build configuration and mypy override for _ext.
.gitignore Ignores generated C files under pythainlp/**.

Comment on lines +33 to +39
Unlike the pure-Python implementation (which raises ``TypeError``
for empty or multi-character strings via ``ord()``), this
implementation returns ``False`` for any input whose length is
not exactly 1.
"""
if len(ch) != 1:
return False
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_thai_char() no longer matches the pure-Python behavior for invalid input lengths. The Python implementation calls ord(ch) and raises TypeError for ch == "" or multi-character strings, but this Cython version returns False when len(ch) != 1. Because pythainlp.util.thai auto-overrides to the Cython implementation when available, this becomes a user-visible API behavior change. Consider raising TypeError with the same message as ord() when len(ch) != 1 (slow path), so the optimized path stays compatible for valid inputs.

Suggested change
Unlike the pure-Python implementation (which raises ``TypeError``
for empty or multi-character strings via ``ord()``), this
implementation returns ``False`` for any input whose length is
not exactly 1.
"""
if len(ch) != 1:
return False
This function mirrors the pure-Python implementation: a
:class:`TypeError` is raised (with the same message as
:func:`ord`) if ``ch`` is an empty string or has length other
than 1.
"""
if len(ch) != 1:
raise TypeError(
"ord() expected a character, but string of length %d found"
% len(ch)
)

Copilot uses AI. Check for mistakes.
Comment on lines +424 to +429
count_thai, # noqa: F811
is_thai, # noqa: F811
is_thai_char, # noqa: F811
)
except ImportError:
pass
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This override makes the Cython implementation the public is_thai_char/is_thai/count_thai at runtime. Since _thai_fast.is_thai_char() currently returns False for empty/multi-character strings (instead of raising TypeError like the pure-Python ord(ch) implementation), importing the extension changes user-visible behavior depending on whether the extension is built. To keep backward compatibility, either keep a small Python wrapper that preserves the original exception behavior, or update the Cython function to raise the same TypeError as ord() for invalid lengths.

Suggested change
count_thai, # noqa: F811
is_thai, # noqa: F811
is_thai_char, # noqa: F811
)
except ImportError:
pass
count_thai as _fast_count_thai,
is_thai as _fast_is_thai,
is_thai_char as _fast_is_thai_char,
)
except ImportError:
pass
else:
# Use fast implementations for is_thai and count_thai directly.
# For is_thai_char, wrap the fast implementation so that we preserve
# the original TypeError behavior of the pure-Python version (which
# relies on ord(ch)) for invalid input lengths.
count_thai = _fast_count_thai # noqa: F811
is_thai = _fast_is_thai # noqa: F811
def is_thai_char(ch: str) -> bool: # noqa: F811
# ord(ch) will raise the same TypeError as the original implementation
# for empty strings or strings of length != 1, preserving behavior.
_ = ord(ch)
return _fast_is_thai_char(ch)

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +11
pythainlp.util.normalize and are loaded as transparent replacements when the
Cython extension is available.
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says these normalization functions “are loaded as transparent replacements when the Cython extension is available”, but pythainlp.util.normalize explicitly does not auto-load Cython overrides (it only keeps _py_* references). Please update the docstring to match the actual behavior to avoid misleading users.

Suggested change
pythainlp.util.normalize and are loaded as transparent replacements when the
Cython extension is available.
pythainlp.util.normalize and can be used as faster drop-in replacements
when explicitly imported.

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +237
def _speedup(self, py_func, cy_func, arg: str, n: int = 5000) -> float:
py_time = timeit.timeit(lambda: py_func(arg), number=n)
cy_time = timeit.timeit(lambda: cy_func(arg), number=n)
return py_time / cy_time

def test_is_thai_char_faster(self) -> None:
from pythainlp.util.thai import _py_is_thai_char as py_is_thai_char

sample = "ก"
speedup = self._speedup(py_is_thai_char, fast_is_thai_char, sample)
self.assertGreater(
speedup,
1.2,
f"is_thai_char speedup {speedup:.1f}x is less than 1.2x",
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assert a minimum speedup (1.2×) in a unit-test suite. Performance assertions tend to be flaky across CI runners / CPU governors / debug builds, and can fail even when the optimization is correct (especially for short inputs or noisy environments). Consider moving speed checks to a benchmark-only script (like scripts/bench_full_evidence.py), or relaxing them to non-failing diagnostics (e.g., log/skip when below threshold) so correctness tests remain stable.

Copilot uses AI. Check for mistakes.
@wannaphong
Copy link
Copy Markdown
Member

It look interesting. I thinking do this for newmm tokenizer too. How do you think? @bact

@bact
Copy link
Copy Markdown
Member

bact commented Apr 2, 2026

It look interesting. I thinking do this for newmm tokenizer too. How do you think? @bact

A valuable contribution. Thanks @chanitnan0jr

Reading reviews from Copilot, I think we need to be explicit about the expected input and output -- so the Cython implementation can follow the expectation.

Like how to deal with wrong type or wrong range. Raise TypeError/ValueError, or trying to resolve it in a sensible and consistent way (put it as 0 or None or empty string or empty list, etc).

This is a design choice.

This also need a lift in publication workflow as well.
Currently, our pypi-publish.yml only publish a regular wheel (non-binary)
https://pypi.org/project/pythainlp/#files

If Cython is implemented, we may need to distribute binary wheels for platforms too. Need some work, but once done it should just runs (will use more resource as well, have to plan). The multi-platform publication can be done by cibuildwheel.
See example from nlpo3-python here: https://github.com/PyThaiNLP/nlpo3/blob/main/.github/workflows/build-python-wheels.yml

The Cython is_thai_char returns False for empty/multi-character strings,
but the pure-Python version raises TypeError (via ord()) for any input
whose length != 1. Because thai.py auto-overrides to the Cython path,
this was a user-visible API behavior change.

Fix: import all three Cython functions under _fast_* aliases, then
explicitly assign count_thai and is_thai as module-level overrides.
For is_thai_char, wrap with a Python function that calls ord(ch) first —
ord() raises TypeError with the same message as the original for any
invalid-length input — then delegates to _fast_is_thai_char for valid
single-character inputs.

Ref: PyThaiNLP#1394 (comment)
Move the count_thai, is_thai, and is_thai_char overrides from the try
block into the else clause so that assignments only execute when the
import succeeds the idiomatic Python pattern for this structure.

No behavior change; purely a structural improvement per maintainer
review at:
PyThaiNLP#1394 (comment)
Replace PEP 604 union syntax (Callable[..., object] | None) with
Optional[Callable[..., object]] from typing, which is supported from
Python 3.9 — the project minimum version for PyThaiNLP 5.x.

Ref: PyThaiNLP#1394 (comment)
…-only usage

The module docstring said functions are "loaded as transparent replacements
when the Cython extension is available", but normalize.py intentionally does
not auto-load Cython overrides callers must import them directly.

Update to: "can be used as faster drop-in replacements when explicitly imported."

Ref: PyThaiNLP#1394 (comment)
FastFunctionPerformanceTest asserted a minimum speedup (1.2x) which
fails on CI runners, CPU governors, or debug builds even when the
optimization is correct. Correctness is already covered by existing
test_*_matches_python tests in FastThaiCorrectnessTest and
FastNormalizeCorrectnessTest.

Remove the entire FastFunctionPerformanceTest class and the timeit
import. Performance evidence is reproducible via the dedicated
scripts/bench_full_evidence.py benchmark script.

Ref: PyThaiNLP#1394 (comment)
Rename the inline wrapper def to _is_thai_char_fast (a new name),
then assign is_thai_char = _is_thai_char_fast with noqa: F811.
The def itself no longer shadows the earlier function definition,
eliminating the "function already defined" lint error.
@chanitnan0jr
Copy link
Copy Markdown
Author

Thank you for the thorough review the feedback genuinely improved the PR. Here's a summary of what was updated:

Backward compatibility

is_thai_char now preserves the original TypeError behavior (empty string / multi-character input) via a Python wrapper in thai.py that delegates ord(ch) validation before calling the Cython path. Error message and exception type are identical to the pure-Python version. (38f289e5)
Refactored the Cython override block to use try/except/else for a cleaner and more explicit structure. (844643b0)

Test stability

Removed the 1.2× speedup assertions — they were flaky across CI runners and CPU governors. Correctness is still strictly verified against the pure-Python baseline. Performance evidence lives in scripts/bench_full_evidence.py where it belongs. (1fa09706)

Compatibility & docs

Replaced PEP 604 | union syntax with typing.Optional in the benchmark script for Python 3.9 support. (1d4e811e)
Corrected the _normalize_fast.pyx docstring to reflect that it is for explicit import only. (7e92cb70)
On future work

@wannaphong I agree that Cython optimizations for newmm are the logical next step. I’m looking forward to exploring that in a follow up PR after this foundational work is merged.

@bact I fully support cibuildwheel is the right path forward for binary distribution. Keeping the build optional for now ensures a stable transition for the current workflow, but I’m ready to assist with the automated publishing pipeline whenever the team is ready to move in that direction

Benchmark Evidence (reproducible via PYTHONPATH=. python3 scripts/bench_full_evidence.py)

Environment
OS : Linux 6.8.0-100-generic (x86_64)
CPU : AMD Ryzen 5 5600H with Radeon Graphics
Python : 3.12.3 (GCC 13.3.0)
pythainlp : 5.3.3
Cython ext : loaded (compiled)

is_thai_char — 1 M calls, single character:

Python (s) Cython (s) Speedup
0.0729 0.0390 1.9×

is_thai — scales with input length:

Input Python (s) Cython (s) Speedup
10 chars (500 K calls) 0.1884 0.0446 4.2×
~310 chars (100 K calls) 0.1057 0.0183 5.8×
~12.8 K chars (10 K calls) 0.0147 0.0023 6.5×
~128 K chars (1 K calls) 0.0014 0.0002 6.2×

count_thai — speedup scales with data size:

Input Python (s) Cython (s) Speedup
10 chars (500 K calls) 0.2235 0.0481 4.6×
~310 chars (50 K calls) 0.5659 0.0725 7.8×
~12.8 K chars (5 K calls) 2.8116 0.3190 8.8×
~128 K chars (500 calls) 2.8303 0.3096 9.1×

remove_tonemark — not auto-loaded (by design):

Input Python (s) Cython (s) Speedup
25 chars 0.1785 0.4536 0.4×
~7.6 K chars 0.0897 0.2321 0.4×

remove_tonemark is intentionally excluded from auto-loading. CPython's str.replace() operates directly on the internal string buffer in C with no encode/decode round-trip, while the Cython version adds UTF-8 encoding overhead — making it slower on this workload. It remains available for explicit import by callers who have profiled their specific use case.

typing.Callable is deprecated since Python 3.9 (PEP 585).
Split into collections.abc.Callable (for the type) and typing.Optional
(still needed for Optional[...] syntax on Python 3.9).

Ref: https://peps.python.org/pep-0585/
F811 (Redefinition of unused name from import) only fires when a name
from an import statement is redefined. The assignments in the else block
(count_thai = _fast_count_thai, etc.) redefine names originally created
by def statements, not imports — so F811 never fires on these lines.

The unused noqa directives were triggering RUF100 (unused noqa
directive), causing the ruff CI check to fail.
@chanitnan0jr
Copy link
Copy Markdown
Author

Three additional fixes pushed:

Commit Fix File Detail
6650c391 PEP 585 bench_full_evidence.py Replaced from typing import Callable with from collections.abc import Callable
d2711e0b ruff RUF100 pythainlp/util/thai.py Removed # noqa: F811 from assignment lines in else block (F811 only applies to imports, not assignments)
4b2b7c88 isort + mypy thai.py, _noauto_loader.py Split grouped _thai_fast import into three lines; added return type to make_load_tests

Regarding the remaining CI test failures (test_tag, test_cli, testc_parse) I have confirmed that these are pre-existing issues and were not introduced by this PR. These same failures are present on commit 0207d400, the point where this branch diverged.

The root cause appears to be a TypeError: Expected dict, got collections.defaultdict in _tag_perceptron.py, which remains outside the scope of this performance optimization PR.

- Add `build_wheels` job in `pypi-publish.yml` to build OS matrices over cibuildwheel.
- Split `twine check` validation logic correctly across platforms using multi-line.
- Downgrade Github Action versions to safe latest variables to correct CI errors.
- Document and establish `pyproject.toml` parameters for Linux, macOS, and Windows.
- Condense the wheel test command cross-compatible for PowerShell.
- Pin `actions/checkout`, `actions/setup-python`, `actions/upload-artifact`, `actions/download-artifact`
- Pin `pypa/cibuildwheel` and `pypa/gh-action-pypi-publish`
- Resolves Security Hotspot githubactions:S7637 by preventing unverified mutable tag attacks.
- Keep readable version tags as inline comments for maintainability.
…tch-cython from compiling pure Python scripts
@chanitnan0jr
Copy link
Copy Markdown
Author

Update: Correction on the CI failures

I need to correct my previous comment and apologize for the confusion. It turns out the test failures weren't a pre-existing bug after all, but rather a configuration issue with the new build setup in this PR.

What happened:
The hatch-cython plugin was accidentally compiling all pure Python (.py) files into C extensions. Cython is very strict about type hints, so when it compiled the existing code and encountered a defaultdict where a dict was expected (in _tag_perceptron.py), it threw a TypeError.

The fix:
The compile_py = false flag in pyproject.toml was placed in the wrong section. I've moved it to the correct [tool.hatch.build.hooks.cython.options] block. Now it only targets the intended .pyx files.

I have verified the fix locally and the tests are passing 100%. The CI should be fully green now alongside the wheel builds.

Again, I sincerely apologize for jumping to conclusions earlier about the bug. Thank you for your patience while I sorted this out.

cc @bact @wannaphong, ready for your review whenever you have time. Thanks!

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 5, 2026

Coverage Status

coverage: 66.656% (+0.02%) from 66.633% — chanitnan0jr:dev into PyThaiNLP:dev

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhance functionalities

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

5 participants