This module implements a Chinese Word Segmentation system using the Transformer architecture for sequence labeling.
Chinese word segmentation is a fundamental NLP task where continuous character sequences are split into meaningful words. This is challenging because Chinese text has no explicit word boundaries (spaces).
We use the B/E/S/M tagging scheme:
| Tag | Meaning | Example |
|---|---|---|
| B | Begin of word | 中[B] in "中国" |
| E | End of word | 国[E] in "中国" |
| S | Single character word | 的[S] |
| M | Middle of word | 民[M] in "人民币" |
Example:
Input: 我 爱 中 国
Tags: S S B E
Output: 我 / 爱 / 中国
The training data is located in ./dataset/:
train.tags.en-zh.en- Source texttrain.tags.zh-en.zh- Target text (with segmentation labels)test.tags.en-zh.en- Test sourcetest.tags.zh-en.zh- Test target
# Preprocess the data
python prepro.pyThis creates vocabulary files in the ./preprocessed/ directory.
python train.pyTraining logs and checkpoints will be saved to seq2seq_model_dir/.
python eval.py| Metric | Score |
|---|---|
| BLEU | ~80 |
The model treats word segmentation as a sequence-to-sequence problem:
Input Characters → Transformer Encoder → Transformer Decoder → B/E/S/M Tags
- Encoder: Encodes input character sequence
- Decoder: Generates corresponding segmentation tags
- Attention: Both self-attention and cross-attention mechanisms
transformer_jieba/
├── dataset/
│ ├── train.tags.en-zh.en # Training source
│ ├── train.tags.zh-en.zh # Training target
│ ├── test.tags.en-zh.en # Test source
│ ├── test.tags.zh-en.zh # Test target
│ └── train.txt # Raw training data
├── data_load.py # Data loading utilities
├── data_pre.py # Data preprocessing
├── eval.py # Model evaluation
├── modules.py # Transformer building blocks
├── prepro.py # Vocabulary preprocessing
├── train.py # Training script
└── README.md # This file
| Parameter | Value |
|---|---|
| Max Sequence Length | 100 |
| Hidden Units | 512 |
| Encoder/Decoder Blocks | 5 |
| Attention Heads | 8 |
| Dropout Rate | 0.1 |
| Learning Rate | 0.0001 |
| Batch Size | 32 |
- Attention Is All You Need
- Chinese word segmentation overview: ACL Anthology