07 Apr 10:13

wannaphong

e64a9cf

LEKCut 1.0 Released! Latest

Latest

LEKCut

LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.

Install

pip install lekcut

How to use

from lekcut import word_tokenize

# DeepCut model (default)
word_tokenize("ทดสอบการตัดคำ")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# AttaCut syllable + character model
word_tokenize("ทดสอบการตัดคำ", model="attacut-sc")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# AttaCut character-only model
word_tokenize("ทดสอบการตัดคำ", model="attacut-c")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

# OSKut model
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut")
# output: ['เบียร์', 'ยู', 'ไม่', 'อ', 'ร่อย']

# OSKut with a specific engine
word_tokenize("เบียร์ยูไม่อร่อย", model="oskut", engine="tnhc")
# output: ['เบียร์', 'ยู', 'ไม่', 'อร่อย']

# SEFR_CUT model
word_tokenize("เบียร์ยูไม่อร่อย", model="sefr-tnhc")
# output: ['เบียร์', 'ยู', 'ไม่', 'อร่อย']

API

word_tokenize(
    text: str,
    model: str = "deepcut",
    path: str = "default",
    providers: List[str] = None,
    engine: str = "ws",
    k: int = 1,
) -> List[str]

Parameters:

text: Text to tokenize
model: Model to use. Options: "deepcut" (default), "attacut-sc", "attacut-c", "oskut", "sefr-best", "sefr-tnhc", "sefr-ws1000"
path: Path to custom model file (default: "default", applies to deepcut and attacut-* models)
providers: List of ONNX Runtime execution providers (default: None, which uses default CPU provider)
engine: OSKut engine variant (applies to "oskut" model only). Options: "ws" (default), "ws-augment-60p", "tnhc", "scads", "tl-deepcut-ws", "tl-deepcut-tnhc", "deepcut"
k: Percentage of characters to refine for OSKut (applies to "oskut" model only). The special default value of 1 is a sentinel that lets OSKut automatically select an appropriate percentage based on the engine. Pass any integer from 2 to 100 to override.

GPU Support

LEKCut supports GPU acceleration through ONNX Runtime execution providers. To use GPU acceleration:

Install ONNX Runtime with GPU support:
```
pip install onnxruntime-gpu
```

Use the providers parameter to specify GPU execution:

from lekcut import word_tokenize

# Use CUDA GPU
result = word_tokenize("ทดสอบการตัดคำ", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

# Use TensorRT (if available)
result = word_tokenize("ทดสอบการตัดคำ", providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])

Available Execution Providers:

CPUExecutionProvider - Default CPU execution
CUDAExecutionProvider - NVIDIA CUDA GPU acceleration
TensorrtExecutionProvider - NVIDIA TensorRT optimization
DmlExecutionProvider - DirectML for Windows GPU
And more (see ONNX Runtime documentation)

Note: The providers are tried in order, and the first available one will be used. Always include CPUExecutionProvider as a fallback.

Model

deepcut - We ported deepcut model from tensorflow.keras to ONNX model. The model and code come from Deepcut's Github. The model is here.
attacut-sc - We ported the AttaCut syllable + character model from PyTorch to ONNX. The model and code come from AttaCut's Github. Requires the ssg package for syllable tokenization.
attacut-c - We ported the AttaCut character-only model from PyTorch to ONNX. The model and code come from AttaCut's Github.
oskut - We ported the OSKut (Out-of-domain Stacked Cut) stacked ensemble models from TensorFlow/Keras to ONNX. The model and code come from OSKut's Github. Requires the pyahocorasick package. Supports multiple engines: ws (default), ws-augment-60p, tnhc, scads, tl-deepcut-ws, tl-deepcut-tnhc, deepcut.
SEFR_CUT- We ported the SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation) model from PyTorch to ONNX. The model and code come from SEFR_CUT's Github. List models: "sefr-best", "sefr-tnhc", "sefr-ws1000"

Load custom model

If you have trained your custom model from deepcut or other that LEKCut support, You can load the custom model by path in word_tokenize after porting your model.

How to train custom model with your dataset by deepcut - Notebook (Needs to update deepcut/train.py before train model)

What's Changed

Add GPU support via ONNX Runtime execution providers by @Copilot in #3
Add AttaCut model to ONNX by @Copilot in #4
Add SEFR CUT (ONNX-based) tokenizer by @Copilot in #8
Add OSKut model support to LEKCut (ONNX-based, no TensorFlow dependency) by @Copilot in #6
Add unittest suite and GitHub Actions CI workflow by @Copilot in #10
Fix build error: include requirements.txt in sdist via MANIFEST.in by @Copilot in #12

New Contributors

@Copilot made their first contribution in #3

Full Changelog: v0.1...v1.0.0

Assets 2

07 Apr 09:25

wannaphong

v1.0.0-beta1

31b4adb

v1.0.0-beta1 Pre-release

Pre-release

What's Changed

Add GPU support via ONNX Runtime execution providers by @Copilot in #3
Add AttaCut model to ONNX by @Copilot in #4
Add SEFR CUT (ONNX-based) tokenizer by @Copilot in #8
Add OSKut model support to LEKCut (ONNX-based, no TensorFlow dependency) by @Copilot in #6
Add unittest suite and GitHub Actions CI workflow by @Copilot in #10
Fix build error: include requirements.txt in sdist via MANIFEST.in by @Copilot in #12

New Contributors

@Copilot made their first contribution in #3

Full Changelog: v0.1...v1.0.0-beta1

Assets 2

28 Oct 16:16

wannaphong

v0.1

4e52838

LEKCut 0.1 Released!

LEKCut

LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.

Install

pip install lekcut

How to use

from lekcut import word_tokenize
word_tokenize("ทดสอบการตัดคำ")
# output: ['ทดสอบ', 'การ', 'ตัด', 'คำ']

API

word_tokenize(text: str, model: str="deepcut", path: str="default") -> List[str]

Model

deepcut - We ported deepcut model from tensorflow.keras to ONNX model. The model and code come from Deepcut's Github. The model is here.

Load custom model

If you has trained custom your model from deepcut or other that LEKCut support, You can load the custom model by path in word_tokenize after porting your model.

How to train custom model ith your dataset by deepcut - Notebook (Needs to update deepcut/train.py before train model)

How to porting model?

See notebooks/

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

LEKCut

Install

How to use

GPU Support

Model

Load custom model

What's Changed

New Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

LEKCut

Install

How to use

Model

Load custom model

How to porting model?

Uh oh!

Releases: PyThaiNLP/LEKCut

LEKCut 1.0 Released!

LEKCut

Install

How to use

GPU Support

Model

Load custom model

What's Changed

New Contributors

Uh oh!

v1.0.0-beta1

What's Changed

New Contributors

Uh oh!

LEKCut 0.1 Released!

LEKCut

Install

How to use

Model

Load custom model

How to porting model?

Uh oh!