Skip to content

Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372

Open
Copilot wants to merge 2 commits intodevfrom
copilot/remove-lekcut-insert-deepcut-onnx
Open

Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference#1372
Copilot wants to merge 2 commits intodevfrom
copilot/remove-lekcut-insert-deepcut-onnx

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 29, 2026

What do these changes do

Replaces the external TensorFlow-based deepcut package dependency with a built-in ONNX inference engine. The DeepCut ONNX model (ported from LEKCut) is now bundled directly with PyThaiNLP, removing the need for TensorFlow and reducing dependencies.

What was wrong

The pythainlp.tokenize.deepcut module depended on the external deepcut package which requires TensorFlow (~1–2 GB), making it a heavy optional dependency. The project already uses onnxruntime for other models (e.g., thai2rom, sefr_cut), so the TensorFlow dependency was unnecessary.

How this fixes it

  • pythainlp/corpus/deepcut.onnx (new): Bundles the DeepCut ONNX model (2.1 MB) ported from LEKCut directly in the package corpus directory.
  • pythainlp/corpus/default_db.json: Adds a deepcut_onnx entry so get_corpus_path() resolves the bundled model file.
  • pythainlp/tokenize/deepcut.py: Completely rewritten with direct ONNX inference using onnxruntime and numpy. Implements character/type feature encoding from the original DeepCut model. The segment() API is unchanged; custom_dict is kept for backward compatibility.
  • pyproject.toml: Removes deepcut (TF package) from noauto-tensorflow deps, full deps, and mypy overrides.
  • Tests: Deepcut test cases moved from tests/noauto_tensorflow/ to tests/noauto_onnx/, reflecting the new ONNX-only dependency.
  • CHANGELOG.md: Change documented under [Unreleased].

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

@wannaphong
Copy link
Copy Markdown
Member

@copilot Try again

@wannaphong wannaphong added this to the 6.0 milestone Mar 29, 2026
- Bundle deepcut.onnx model (from LEKCut) in pythainlp/corpus/
- Add deepcut_onnx entry to default_db.json
- Rewrite pythainlp/tokenize/deepcut.py with ONNX inference;
  no longer depends on the TF-based deepcut package
- Move deepcut tests from noauto_tensorflow/ to noauto_onnx/
- Remove deepcut from pyproject.toml TF deps and mypy overrides
- Update CHANGELOG.md

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/ff10e57e-b105-4263-9f34-12b2e67cd4ab

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 29, 2026

@copilot Try again

Done in e6ca3fa. Here's what was implemented:

  • pythainlp/tokenize/deepcut.py: Completely rewritten with built-in ONNX inference (onnxruntime + numpy). No longer requires the TensorFlow-based deepcut package. Character/type feature encoding and model inference are all implemented directly.
  • pythainlp/corpus/deepcut.onnx: The DeepCut ONNX model (2.1 MB, ported from LEKCut) is now bundled with PyThaiNLP.
  • pythainlp/corpus/default_db.json: Added deepcut_onnx entry so get_corpus_path() resolves the bundled model.
  • pyproject.toml: Removed deepcut (TF package) from noauto-tensorflow deps, full deps, and mypy overrides.
  • Tests: Deepcut tests moved from tests/noauto_tensorflow/ to tests/noauto_onnx/.
  • CHANGELOG.md: Change documented under [Unreleased].

The segment() API is unchanged; custom_dict is kept for backward compatibility.

@sonarqubecloud
Copy link
Copy Markdown

Copilot AI changed the title [WIP] Remove lekcut and insert deepcut onnx model to pythainlp Migrate deepcut tokenizer from TensorFlow to built-in ONNX inference Mar 29, 2026
@bact
Copy link
Copy Markdown
Member

bact commented Mar 31, 2026

Alternatively, we can call the DeepcutTokenizer in nlpo3

@wannaphong
Copy link
Copy Markdown
Member

Alternatively, we can call the DeepcutTokenizer in nlpo3

Yes but I think it should call from python because we already used onnxruntime in pythainlp.

@wannaphong wannaphong marked this pull request as ready for review April 5, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Remove lekcut and insert deepcut onnx model to pythainlp

3 participants