|Apache-2.0| |Python| |NLP|
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
- Features
- Prerequisites
- Installation
- Configuration
- Usage
- Development
- Testing
- Contributing
- How to get help
- Terms of use
- Tibetan word tokenization using statistical and rule-based methods
- Support for both classical and modern Tibetan
- Integration with PyBo (Python Buddhist)
- Customizable segmentation rules -字典-based word lookup
- Python 3.8+
- pip
# Clone the repository
git clone https://github.com/OpenPecha/Botok.git
cd Botok
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .Botok can be configured via:
- Environment variables
- YAML configuration files in
config/ - Python API
import botok
# Create a tokenizer instance
tokenizer = botok.Tok()
# Tokenize Tibetan text
text = "བོད་ཡིག་གི་དཔེ་ཆ་"
tokens = tokenizer.tokenize(text)
print(tokens)# Install dev dependencies
pip install -e .[dev]
# Run tests
pytest
# Lint
flake8 botok/pytest tests/- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Please read CONTRIBUTING.md for details.
- File an issue.
- Join our discord.
Botok is licensed under the Apache-2.0 License.
