Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 176 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,167 @@
# spaCy-PyThaiNLP
This package wraps the PyThaiNLP library to add support Thai for spaCy.

[![PyPI version](https://img.shields.io/pypi/v/spacy-pythainlp.svg)](https://pypi.org/project/spacy-pythainlp/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

This package wraps the [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) library to add Thai language support for [spaCy](https://spacy.io/).

## Features

**Support List**
- Word segmentation
- Part-of-speech Tagging
- Named entity recognition
- Word segmentation (tokenization)
- Part-of-speech tagging
- Named entity recognition (NER)
- Sentence segmentation
- Dependency parsing
- Word vector
- Word vectors

## Table of Contents

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Usage Examples](#usage-examples)
- [Basic Sentence Segmentation](#basic-sentence-segmentation)
- [Part-of-Speech Tagging](#part-of-speech-tagging)
- [Named Entity Recognition](#named-entity-recognition)
- [Dependency Parsing](#dependency-parsing)
- [Word Vectors](#word-vectors)
- [Configuration](#configuration)
- [License](#license)

## Install
## Installation

> pip install spacy-pythainlp
### Prerequisites

## How to use
- Python 3.9 or higher
- spaCy 3.0 or higher
- PyThaiNLP 3.1.0 or higher

### Install via pip

```bash
pip install spacy-pythainlp
```

## Quick Start

**Example**
```python
import spacy
import spacy_pythainlp.core

# Create a blank Thai language model
nlp = spacy.blank("th")
# Segment the Doc into sentences
nlp.add_pipe(
"pythainlp",
)

data=nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
print(list(list(data.sents)))
# output: [ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน , ผมอยากไปเที่ยว]
# Add the PyThaiNLP pipeline component
nlp.add_pipe("pythainlp")

# Process text
doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")

# Access sentences
for sent in doc.sents:
print(sent)
# Output:
# ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน
# ผมอยากไปเที่ยว
```

## Usage Examples

### Basic Sentence Segmentation

```python
import spacy
import spacy_pythainlp.core

nlp = spacy.blank("th")
nlp.add_pipe("pythainlp")

doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")

# Get sentences
sentences = list(doc.sents)
print(f"Number of sentences: {len(sentences)}")
for i, sent in enumerate(sentences, 1):
print(f"Sentence {i}: {sent.text}")
```

### Part-of-Speech Tagging

```python
import spacy
import spacy_pythainlp.core

nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"pos": True})

doc = nlp("ผมเป็นคนไทย")

# Print tokens with POS tags
for token in doc:
print(f"{token.text}: {token.pos_}")
```

### Named Entity Recognition

```python
import spacy
import spacy_pythainlp.core

nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"ner": True})

doc = nlp("วันที่ 15 กันยายน 2564 ทดสอบระบบที่กรุงเทพ")

# Print named entities
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
```

You can config the setting in the nlp.add_pipe.
### Dependency Parsing

```python
import spacy
import spacy_pythainlp.core

nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"dependency_parsing": True})

doc = nlp("ผมเป็นคนไทย")

# Print dependency relations
for token in doc:
print(f"{token.text}: {token.dep_} <- {token.head.text}")
```

### Word Vectors

```python
import spacy
import spacy_pythainlp.core

nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"word_vector": True, "word_vector_model": "thai2fit_wv"})

doc = nlp("แมว สุนัข")

# Access word vectors
for token in doc:
print(f"{token.text}: vector shape = {token.vector.shape}")

# Calculate similarity
token1 = doc[0] # แมว
token2 = doc[1] # สุนัข
print(f"Similarity: {token1.similarity(token2)}")
```

## Configuration

You can customize the PyThaiNLP pipeline component by passing a configuration dictionary to `nlp.add_pipe()`:

```python
nlp.add_pipe(
"pythainlp",
"pythainlp",
config={
"pos_engine": "perceptron",
"pos": True,
Expand All @@ -56,21 +181,39 @@ nlp.add_pipe(
)
```

- tokenize: Bool (True or False) to change the word tokenize. (the default spaCy is newmm of PyThaiNLP)
- tokenize_engine: The tokenize engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize)
- sent: Bool (True or False) to turn on the sentence tokenizer.
- sent_engine: The sentence tokenizer engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize)
- pos: Bool (True or False) to turn on the part-of-speech.
- pos_engine: The part-of-speech engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag)
- ner: Bool (True or False) to turn on the NER.
- ner_engine: The NER engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER)
- dependency_parsing: Bool (True or False) to turn on the Dependency parsing.
- dependency_parsing_engine: The Dependency parsing engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
- dependency_parsing_model: The Dependency parsing model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
- word_vector: Bool (True or False) to turn on the word vector.
- word_vector_model: The word vector model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector)

**Note: If you turn on Dependency parsing, word segmentation and sentence segmentation are turn off to use word segmentation and sentence segmentation from Dependency parsing.**
### Configuration Options

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tokenize` | `bool` | `False` | Enable/disable word tokenization (spaCy uses PyThaiNLP's newmm by default) |
| `tokenize_engine` | `str` | `"newmm"` | Tokenization engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize) |
| `sent` | `bool` | `True` | Enable/disable sentence segmentation |
| `sent_engine` | `str` | `"crfcut"` | Sentence tokenizer engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize) |
| `pos` | `bool` | `True` | Enable/disable part-of-speech tagging |
| `pos_engine` | `str` | `"perceptron"` | POS tagging engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag) |
| `pos_corpus` | `str` | `"orchid_ud"` | Corpus for POS tagging |
| `ner` | `bool` | `True` | Enable/disable named entity recognition |
| `ner_engine` | `str` | `"thainer"` | NER engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER) |
| `dependency_parsing` | `bool` | `False` | Enable/disable dependency parsing |
| `dependency_parsing_engine` | `str` | `"esupar"` | Dependency parsing engine. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
| `dependency_parsing_model` | `str` | `None` | Dependency parsing model. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
| `word_vector` | `bool` | `True` | Enable/disable word vectors |
| `word_vector_model` | `str` | `"thai2fit_wv"` | Word vector model. [See options](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector) |

**Important Notes:**
- When `dependency_parsing` is enabled, word segmentation and sentence segmentation are automatically disabled to use the tokenization from the dependency parser.
- All configuration options are optional and have sensible defaults.

## Resources

- [PyThaiNLP Documentation](https://pythainlp.github.io/)
- [spaCy Documentation](https://spacy.io/)
- [GitHub Repository](https://github.com/PyThaiNLP/spaCy-PyThaiNLP)
- [Issue Tracker](https://github.com/PyThaiNLP/spaCy-PyThaiNLP/issues)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

## License

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
author_email="wannaphong@yahoo.com",
url="https://github.com/PyThaiNLP/spaCy-PyThaiNLP",
packages=["spacy_pythainlp"],
python_requires=">=3.7",
python_requires=">=3.9",
include_package_data=True,
install_requires=requirements,
license="Apache Software License 2.0",
Expand Down