Skip to content

Commit c3c41b3

Browse files
authored
Merge pull request #7 from PyThaiNLP/copilot/improve-readme-file
Enhance README with comprehensive examples and structured documentation
2 parents da571f6 + 023d7b0 commit c3c41b3

2 files changed

Lines changed: 177 additions & 34 deletions

File tree

README.md

Lines changed: 176 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,167 @@
11
# spaCy-PyThaiNLP
2-
This package wraps the PyThaiNLP library to add support Thai for spaCy.
2+
3+
[![PyPI version](https://img.shields.io/pypi/v/spacy-pythainlp.svg)](https://pypi.org/project/spacy-pythainlp/)
4+
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
5+
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
6+
7+
This package wraps the [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) library to add Thai language support for [spaCy](https://spacy.io/).
8+
9+
## Features
310

411
**Support List**
5-
- Word segmentation
6-
- Part-of-speech Tagging
7-
- Named entity recognition
12+
- Word segmentation (tokenization)
13+
- Part-of-speech tagging
14+
- Named entity recognition (NER)
815
- Sentence segmentation
916
- Dependency parsing
10-
- Word vector
17+
- Word vectors
18+
19+
## Table of Contents
1120

21+
- [Installation](#installation)
22+
- [Quick Start](#quick-start)
23+
- [Usage Examples](#usage-examples)
24+
- [Basic Sentence Segmentation](#basic-sentence-segmentation)
25+
- [Part-of-Speech Tagging](#part-of-speech-tagging)
26+
- [Named Entity Recognition](#named-entity-recognition)
27+
- [Dependency Parsing](#dependency-parsing)
28+
- [Word Vectors](#word-vectors)
29+
- [Configuration](#configuration)
30+
- [License](#license)
1231

13-
## Install
32+
## Installation
1433

15-
> pip install spacy-pythainlp
34+
### Prerequisites
1635

17-
## How to use
36+
- Python 3.9 or higher
37+
- spaCy 3.0 or higher
38+
- PyThaiNLP 3.1.0 or higher
1839

40+
### Install via pip
41+
42+
```bash
43+
pip install spacy-pythainlp
44+
```
45+
46+
## Quick Start
1947

20-
**Example**
2148
```python
2249
import spacy
2350
import spacy_pythainlp.core
2451

52+
# Create a blank Thai language model
2553
nlp = spacy.blank("th")
26-
# Segment the Doc into sentences
27-
nlp.add_pipe(
28-
"pythainlp",
29-
)
3054

31-
data=nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
32-
print(list(list(data.sents)))
33-
# output: [ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน , ผมอยากไปเที่ยว]
55+
# Add the PyThaiNLP pipeline component
56+
nlp.add_pipe("pythainlp")
57+
58+
# Process text
59+
doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
60+
61+
# Access sentences
62+
for sent in doc.sents:
63+
print(sent)
64+
# Output:
65+
# ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน
66+
# ผมอยากไปเที่ยว
67+
```
68+
69+
## Usage Examples
70+
71+
### Basic Sentence Segmentation
72+
73+
```python
74+
import spacy
75+
import spacy_pythainlp.core
76+
77+
nlp = spacy.blank("th")
78+
nlp.add_pipe("pythainlp")
79+
80+
doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
81+
82+
# Get sentences
83+
sentences = list(doc.sents)
84+
print(f"Number of sentences: {len(sentences)}")
85+
for i, sent in enumerate(sentences, 1):
86+
print(f"Sentence {i}: {sent.text}")
87+
```
88+
89+
### Part-of-Speech Tagging
90+
91+
```python
92+
import spacy
93+
import spacy_pythainlp.core
94+
95+
nlp = spacy.blank("th")
96+
nlp.add_pipe("pythainlp", config={"pos": True})
97+
98+
doc = nlp("ผมเป็นคนไทย")
99+
100+
# Print tokens with POS tags
101+
for token in doc:
102+
print(f"{token.text}: {token.pos_}")
103+
```
104+
105+
### Named Entity Recognition
106+
107+
```python
108+
import spacy
109+
import spacy_pythainlp.core
110+
111+
nlp = spacy.blank("th")
112+
nlp.add_pipe("pythainlp", config={"ner": True})
113+
114+
doc = nlp("วันที่ 15 กันยายน 2564 ทดสอบระบบที่กรุงเทพ")
115+
116+
# Print named entities
117+
for ent in doc.ents:
118+
print(f"{ent.text}: {ent.label_}")
34119
```
35120

36-
You can config the setting in the nlp.add_pipe.
121+
### Dependency Parsing
122+
123+
```python
124+
import spacy
125+
import spacy_pythainlp.core
126+
127+
nlp = spacy.blank("th")
128+
nlp.add_pipe("pythainlp", config={"dependency_parsing": True})
129+
130+
doc = nlp("ผมเป็นคนไทย")
131+
132+
# Print dependency relations
133+
for token in doc:
134+
print(f"{token.text}: {token.dep_} <- {token.head.text}")
135+
```
136+
137+
### Word Vectors
138+
139+
```python
140+
import spacy
141+
import spacy_pythainlp.core
142+
143+
nlp = spacy.blank("th")
144+
nlp.add_pipe("pythainlp", config={"word_vector": True, "word_vector_model": "thai2fit_wv"})
145+
146+
doc = nlp("แมว สุนัข")
147+
148+
# Access word vectors
149+
for token in doc:
150+
print(f"{token.text}: vector shape = {token.vector.shape}")
151+
152+
# Calculate similarity
153+
token1 = doc[0] # แมว
154+
token2 = doc[1] # สุนัข
155+
print(f"Similarity: {token1.similarity(token2)}")
156+
```
157+
158+
## Configuration
159+
160+
You can customize the PyThaiNLP pipeline component by passing a configuration dictionary to `nlp.add_pipe()`:
161+
37162
```python
38163
nlp.add_pipe(
39-
"pythainlp",
164+
"pythainlp",
40165
config={
41166
"pos_engine": "perceptron",
42167
"pos": True,
@@ -56,21 +181,39 @@ nlp.add_pipe(
56181
)
57182
```
58183

59-
- tokenize: Bool (True or False) to change the word tokenize. (the default spaCy is newmm of PyThaiNLP)
60-
- tokenize_engine: The tokenize engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize)
61-
- sent: Bool (True or False) to turn on the sentence tokenizer.
62-
- sent_engine: The sentence tokenizer engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize)
63-
- pos: Bool (True or False) to turn on the part-of-speech.
64-
- pos_engine: The part-of-speech engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag)
65-
- ner: Bool (True or False) to turn on the NER.
66-
- ner_engine: The NER engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER)
67-
- dependency_parsing: Bool (True or False) to turn on the Dependency parsing.
68-
- dependency_parsing_engine: The Dependency parsing engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
69-
- dependency_parsing_model: The Dependency parsing model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
70-
- word_vector: Bool (True or False) to turn on the word vector.
71-
- word_vector_model: The word vector model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector)
72-
73-
**Note: If you turn on Dependency parsing, word segmentation and sentence segmentation are turn off to use word segmentation and sentence segmentation from Dependency parsing.**
184+
### Configuration Options
185+
186+
| Parameter | Type | Default | Description |
187+
|-----------|------|---------|-------------|
188+
| `tokenize` | `bool` | `False` | Enable/disable word tokenization (spaCy uses PyThaiNLP's newmm by default) |
189+
| `tokenize_engine` | `str` | `"newmm"` | Tokenization engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize) |
190+
| `sent` | `bool` | `True` | Enable/disable sentence segmentation |
191+
| `sent_engine` | `str` | `"crfcut"` | Sentence tokenizer engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize) |
192+
| `pos` | `bool` | `True` | Enable/disable part-of-speech tagging |
193+
| `pos_engine` | `str` | `"perceptron"` | POS tagging engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag) |
194+
| `pos_corpus` | `str` | `"orchid_ud"` | Corpus for POS tagging |
195+
| `ner` | `bool` | `True` | Enable/disable named entity recognition |
196+
| `ner_engine` | `str` | `"thainer"` | NER engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER) |
197+
| `dependency_parsing` | `bool` | `False` | Enable/disable dependency parsing |
198+
| `dependency_parsing_engine` | `str` | `"esupar"` | Dependency parsing engine. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
199+
| `dependency_parsing_model` | `str` | `None` | Dependency parsing model. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
200+
| `word_vector` | `bool` | `True` | Enable/disable word vectors |
201+
| `word_vector_model` | `str` | `"thai2fit_wv"` | Word vector model. [See options](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector) |
202+
203+
**Important Notes:**
204+
- When `dependency_parsing` is enabled, word segmentation and sentence segmentation are automatically disabled to use the tokenization from the dependency parser.
205+
- All configuration options are optional and have sensible defaults.
206+
207+
## Resources
208+
209+
- [PyThaiNLP Documentation](https://pythainlp.github.io/)
210+
- [spaCy Documentation](https://spacy.io/)
211+
- [GitHub Repository](https://github.com/PyThaiNLP/spaCy-PyThaiNLP)
212+
- [Issue Tracker](https://github.com/PyThaiNLP/spaCy-PyThaiNLP/issues)
213+
214+
## Contributing
215+
216+
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
74217

75218
## License
76219

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
author_email="wannaphong@yahoo.com",
2222
url="https://github.com/PyThaiNLP/spaCy-PyThaiNLP",
2323
packages=["spacy_pythainlp"],
24-
python_requires=">=3.7",
24+
python_requires=">=3.9",
2525
include_package_data=True,
2626
install_requires=requirements,
2727
license="Apache Software License 2.0",

0 commit comments

Comments
 (0)