Skip to content

Latest commit

 

History

History
110 lines (80 loc) · 2.06 KB

File metadata and controls

110 lines (80 loc) · 2.06 KB


OpenPecha

Botok

|Apache-2.0| |Python| |NLP|

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

Table of Contents

Features

  • Tibetan word tokenization using statistical and rule-based methods
  • Support for both classical and modern Tibetan
  • Integration with PyBo (Python Buddhist)
  • Customizable segmentation rules -字典-based word lookup

Prerequisites

  • Python 3.8+
  • pip

Installation

# Clone the repository
git clone https://github.com/OpenPecha/Botok.git
cd Botok

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Configuration

Botok can be configured via:

  • Environment variables
  • YAML configuration files in config/
  • Python API

Usage

import botok

# Create a tokenizer instance
tokenizer = botok.Tok()

# Tokenize Tibetan text
text = "བོད་ཡིག་གི་དཔེ་ཆ་"
tokens = tokenizer.tokenize(text)
print(tokens)

Development

# Install dev dependencies
pip install -e .[dev]

# Run tests
pytest

# Lint
flake8 botok/

Testing

pytest tests/

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

Please read CONTRIBUTING.md for details.

How to get help

Terms of use

Botok is licensed under the Apache-2.0 License.