Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372
Open
Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372
Conversation
Member
|
@copilot Try again |
- Bundle deepcut.onnx model (from LEKCut) in pythainlp/corpus/ - Add deepcut_onnx entry to default_db.json - Rewrite pythainlp/tokenize/deepcut.py with ONNX inference; no longer depends on the TF-based deepcut package - Move deepcut tests from noauto_tensorflow/ to noauto_onnx/ - Remove deepcut from pyproject.toml TF deps and mypy overrides - Update CHANGELOG.md Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/ff10e57e-b105-4263-9f34-12b2e67cd4ab Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Contributor
Author
Done in e6ca3fa. Here's what was implemented:
The |
|
Copilot
AI
changed the title
[WIP] Remove lekcut and insert deepcut onnx model to pythainlp
Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference
Mar 29, 2026
Member
|
Alternatively, we can call the DeepcutTokenizer in nlpo3 |
Member
Yes but I think it should call from python because we already used onnxruntime in pythainlp. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What do these changes do
Replaces the external TensorFlow-based
deepcutpackage dependency with a built-in ONNX inference engine. The DeepCut ONNX model (ported from LEKCut) is now bundled directly with PyThaiNLP, removing the need for TensorFlow and reducing dependencies.What was wrong
The
pythainlp.tokenize.deepcutmodule depended on the externaldeepcutpackage which requires TensorFlow (~1–2 GB), making it a heavy optional dependency. The project already usesonnxruntimefor other models (e.g.,thai2rom,sefr_cut), so the TensorFlow dependency was unnecessary.How this fixes it
pythainlp/corpus/deepcut.onnx(new): Bundles the DeepCut ONNX model (2.1 MB) ported from LEKCut directly in the package corpus directory.pythainlp/corpus/default_db.json: Adds adeepcut_onnxentry soget_corpus_path()resolves the bundled model file.pythainlp/tokenize/deepcut.py: Completely rewritten with direct ONNX inference usingonnxruntimeandnumpy. Implements character/type feature encoding from the original DeepCut model. Thesegment()API is unchanged;custom_dictis kept for backward compatibility.pyproject.toml: Removesdeepcut(TF package) fromnoauto-tensorflowdeps,fulldeps, and mypy overrides.tests/noauto_tensorflow/totests/noauto_onnx/, reflecting the new ONNX-only dependency.CHANGELOG.md: Change documented under[Unreleased].Your checklist for this pull request
📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.