Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference by Copilot · Pull Request #1372 · PyThaiNLP/pythainlp

Copilot · 2026-03-29T08:51:03Z

What do these changes do

Replaces the external TensorFlow-based deepcut package dependency with a built-in ONNX inference engine. The DeepCut ONNX model (ported from LEKCut) is now bundled directly with PyThaiNLP, removing the need for TensorFlow and reducing dependencies.

What was wrong

The pythainlp.tokenize.deepcut module depended on the external deepcut package which requires TensorFlow (~1–2 GB), making it a heavy optional dependency. The project already uses onnxruntime for other models (e.g., thai2rom, sefr_cut), so the TensorFlow dependency was unnecessary.

How this fixes it

pythainlp/corpus/deepcut.onnx (new): Bundles the DeepCut ONNX model (2.1 MB) ported from LEKCut directly in the package corpus directory.
pythainlp/corpus/default_db.json: Adds a deepcut_onnx entry so get_corpus_path() resolves the bundled model file.
pythainlp/tokenize/deepcut.py: Completely rewritten with direct ONNX inference using onnxruntime and numpy. Implements character/type feature encoding from the original DeepCut model. The segment() API is unchanged; custom_dict is kept for backward compatibility.
pyproject.toml: Removes deepcut (TF package) from noauto-tensorflow deps, full deps, and mypy overrides.
Tests: Deepcut test cases moved from tests/noauto_tensorflow/ to tests/noauto_onnx/, reflecting the new ONNX-only dependency.
CHANGELOG.md: Change documented under [Unreleased].

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

wannaphong · 2026-03-29T19:51:34Z

@copilot Try again

- Bundle deepcut.onnx model (from LEKCut) in pythainlp/corpus/ - Add deepcut_onnx entry to default_db.json - Rewrite pythainlp/tokenize/deepcut.py with ONNX inference; no longer depends on the TF-based deepcut package - Move deepcut tests from noauto_tensorflow/ to noauto_onnx/ - Remove deepcut from pyproject.toml TF deps and mypy overrides - Update CHANGELOG.md Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/ff10e57e-b105-4263-9f34-12b2e67cd4ab Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot · 2026-03-29T20:00:41Z

@copilot Try again

Done in e6ca3fa. Here's what was implemented:

pythainlp/tokenize/deepcut.py: Completely rewritten with built-in ONNX inference (onnxruntime + numpy). No longer requires the TensorFlow-based deepcut package. Character/type feature encoding and model inference are all implemented directly.
pythainlp/corpus/deepcut.onnx: The DeepCut ONNX model (2.1 MB, ported from LEKCut) is now bundled with PyThaiNLP.
pythainlp/corpus/default_db.json: Added deepcut_onnx entry so get_corpus_path() resolves the bundled model.
pyproject.toml: Removed deepcut (TF package) from noauto-tensorflow deps, full deps, and mypy overrides.
Tests: Deepcut tests moved from tests/noauto_tensorflow/ to tests/noauto_onnx/.
CHANGELOG.md: Change documented under [Unreleased].

The segment() API is unchanged; custom_dict is kept for backward compatibility.

sonarqubecloud · 2026-03-29T20:01:36Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

bact · 2026-03-31T21:07:22Z

Alternatively, we can call the DeepcutTokenizer in nlpo3

wannaphong · 2026-04-04T14:01:00Z

Alternatively, we can call the DeepcutTokenizer in nlpo3

Yes but I think it should call from python because we already used onnxruntime in pythainlp.

Initial plan

97ca9e0

Copilot AI assigned Copilot and wannaphong Mar 29, 2026

Copilot started work on behalf of wannaphong March 29, 2026 08:51 View session

Copilot AI linked an issue Mar 29, 2026 that may be closed by this pull request

Remove lekcut and insert deepcut onnx model to pythainlp #1371

Open

Copilot AI requested a review from wannaphong March 29, 2026 08:59

bact added this to PyThaiNLP Mar 29, 2026

Copilot started work on behalf of wannaphong March 29, 2026 19:51 View session

wannaphong added this to the 6.0 milestone Mar 29, 2026

Copilot AI changed the title ~~[WIP] Remove lekcut and insert deepcut onnx model to pythainlp~~ Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference Mar 29, 2026

Copilot finished work on behalf of wannaphong March 29, 2026 20:01

wannaphong marked this pull request as ready for review April 5, 2026 14:50

wannaphong mentioned this pull request Apr 7, 2026

bug: numpy-1.26.4 rather old compared to 2.1.3 #1401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372

Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372
Copilot wants to merge 2 commits intodevfrom
copilot/remove-lekcut-insert-deepcut-onnx

Copilot AI commented Mar 29, 2026 •

edited

Loading

Uh oh!

wannaphong commented Mar 29, 2026

Uh oh!

Copilot AI commented Mar 29, 2026

Uh oh!

sonarqubecloud bot commented Mar 29, 2026

Uh oh!

bact commented Mar 31, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Copilot AI commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do

What was wrong

How this fixes it

Your checklist for this pull request

Uh oh!

wannaphong commented Mar 29, 2026

Uh oh!

Copilot AI commented Mar 29, 2026

Uh oh!

sonarqubecloud bot commented Mar 29, 2026

Quality Gate passed

Uh oh!

bact commented Mar 31, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 29, 2026 •

edited

Loading