Merge pull request #7 from PyThaiNLP/copilot/improve-readme-file

wannaphong · web-flow · commit c3c41b309952 · 2026-02-05T16:14:24.000+07:00
Enhance README with comprehensive examples and structured documentation
diff --git a/README.md b/README.md
@@ -1,42 +1,167 @@
 # spaCy-PyThaiNLP
-This package wraps the PyThaiNLP library to add support Thai for spaCy.
+
+[![PyPI version](https://img.shields.io/pypi/v/spacy-pythainlp.svg)](https://pypi.org/project/spacy-pythainlp/)
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+
+This package wraps the [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) library to add Thai language support for [spaCy](https://spacy.io/).
+
+## Features
 
 **Support List**
-- Word segmentation
-- Part-of-speech Tagging
-- Named entity recognition
+- Word segmentation (tokenization)
+- Part-of-speech tagging
+- Named entity recognition (NER)
 - Sentence segmentation
 - Dependency parsing
-- Word vector
+- Word vectors
+
+## Table of Contents
 
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Usage Examples](#usage-examples)
+  - [Basic Sentence Segmentation](#basic-sentence-segmentation)
+  - [Part-of-Speech Tagging](#part-of-speech-tagging)
+  - [Named Entity Recognition](#named-entity-recognition)
+  - [Dependency Parsing](#dependency-parsing)
+  - [Word Vectors](#word-vectors)
+- [Configuration](#configuration)
+- [License](#license)
 
-## Install
+## Installation
 
-> pip install spacy-pythainlp
+### Prerequisites
 
-## How to use
+- Python 3.9 or higher
+- spaCy 3.0 or higher
+- PyThaiNLP 3.1.0 or higher
 
+### Install via pip
+
+```bash
+pip install spacy-pythainlp
+```
+
+## Quick Start
 
-**Example**
 ```python
 import spacy
 import spacy_pythainlp.core
 
+# Create a blank Thai language model
 nlp = spacy.blank("th")
-# Segment the Doc into sentences
-nlp.add_pipe(
-   "pythainlp", 
-)
 
-data=nlp("ผมเป็นคนไทย   แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน  ผมอยากไปเที่ยว")
-print(list(list(data.sents)))
-# output: [ผมเป็นคนไทย   แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน  , ผมอยากไปเที่ยว]
+# Add the PyThaiNLP pipeline component
+nlp.add_pipe("pythainlp")
+
+# Process text
+doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
+
+# Access sentences
+for sent in doc.sents:
+    print(sent)
+# Output:
+# ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน
+# ผมอยากไปเที่ยว
+```
+
+## Usage Examples
+
+### Basic Sentence Segmentation
+
+```python
+import spacy
+import spacy_pythainlp.core
+
+nlp = spacy.blank("th")
+nlp.add_pipe("pythainlp")
+
+doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
+
+# Get sentences
+sentences = list(doc.sents)
+print(f"Number of sentences: {len(sentences)}")
+for i, sent in enumerate(sentences, 1):
+    print(f"Sentence {i}: {sent.text}")
+```
+
+### Part-of-Speech Tagging
+
+```python
+import spacy
+import spacy_pythainlp.core
+
+nlp = spacy.blank("th")
+nlp.add_pipe("pythainlp", config={"pos": True})
+
+doc = nlp("ผมเป็นคนไทย")
+
+# Print tokens with POS tags
+for token in doc:
+    print(f"{token.text}: {token.pos_}")
+```
+
+### Named Entity Recognition
+
+```python
+import spacy
+import spacy_pythainlp.core
+
+nlp = spacy.blank("th")
+nlp.add_pipe("pythainlp", config={"ner": True})
+
+doc = nlp("วันที่ 15 กันยายน 2564 ทดสอบระบบที่กรุงเทพ")
+
+# Print named entities
+for ent in doc.ents:
+    print(f"{ent.text}: {ent.label_}")
 ```
 
-You can config the setting in the nlp.add_pipe.
+### Dependency Parsing
+
+```python
+import spacy
+import spacy_pythainlp.core
+
+nlp = spacy.blank("th")
+nlp.add_pipe("pythainlp", config={"dependency_parsing": True})
+
+doc = nlp("ผมเป็นคนไทย")
+
+# Print dependency relations
+for token in doc:
+    print(f"{token.text}: {token.dep_} <- {token.head.text}")
+```
+
+### Word Vectors
+
+```python
+import spacy
+import spacy_pythainlp.core
+
+nlp = spacy.blank("th")
+nlp.add_pipe("pythainlp", config={"word_vector": True, "word_vector_model": "thai2fit_wv"})
+
+doc = nlp("แมว สุนัข")
+
+# Access word vectors
+for token in doc:
+    print(f"{token.text}: vector shape = {token.vector.shape}")
+    
+# Calculate similarity
+token1 = doc[0]  # แมว
+token2 = doc[1]  # สุนัข
+print(f"Similarity: {token1.similarity(token2)}")
+```
+
+## Configuration
+
+You can customize the PyThaiNLP pipeline component by passing a configuration dictionary to `nlp.add_pipe()`:
+
 ```python
 nlp.add_pipe(
-    "pythainlp", 
+    "pythainlp",
     config={
         "pos_engine": "perceptron",
         "pos": True,
@@ -56,21 +181,39 @@ nlp.add_pipe(
 )
 ```
 
-- tokenize: Bool (True or False) to change the word tokenize. (the default spaCy is newmm of PyThaiNLP)
-- tokenize_engine: The tokenize engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize)
-- sent: Bool (True or False) to turn on the sentence tokenizer.
-- sent_engine: The sentence tokenizer engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize)
-- pos:  Bool (True or False) to turn on the part-of-speech.
-- pos_engine: The part-of-speech engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag)
-- ner: Bool (True or False) to turn on the NER.
-- ner_engine: The NER engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER)
-- dependency_parsing: Bool (True or False) to turn on the Dependency parsing.
-- dependency_parsing_engine: The Dependency parsing engine. You can read more: [Options for engine](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
-- dependency_parsing_model: The Dependency parsing model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing)
-- word_vector: Bool (True or False) to turn on the word vector.
-- word_vector_model: The word vector model. You can read more: [Options for model](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector)
-
-**Note: If you turn on Dependency parsing, word segmentation and sentence segmentation are turn off to use word segmentation and sentence segmentation from Dependency parsing.**
+### Configuration Options
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `tokenize` | `bool` | `False` | Enable/disable word tokenization (spaCy uses PyThaiNLP's newmm by default) |
+| `tokenize_engine` | `str` | `"newmm"` | Tokenization engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.word_tokenize) |
+| `sent` | `bool` | `True` | Enable/disable sentence segmentation |
+| `sent_engine` | `str` | `"crfcut"` | Sentence tokenizer engine. [See options](https://pythainlp.github.io/docs/3.1/api/tokenize.html#pythainlp.tokenize.sent_tokenize) |
+| `pos` | `bool` | `True` | Enable/disable part-of-speech tagging |
+| `pos_engine` | `str` | `"perceptron"` | POS tagging engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.pos_tag) |
+| `pos_corpus` | `str` | `"orchid_ud"` | Corpus for POS tagging |
+| `ner` | `bool` | `True` | Enable/disable named entity recognition |
+| `ner_engine` | `str` | `"thainer"` | NER engine. [See options](https://pythainlp.github.io/docs/3.1/api/tag.html#pythainlp.tag.NER) |
+| `dependency_parsing` | `bool` | `False` | Enable/disable dependency parsing |
+| `dependency_parsing_engine` | `str` | `"esupar"` | Dependency parsing engine. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
+| `dependency_parsing_model` | `str` | `None` | Dependency parsing model. [See options](https://pythainlp.github.io/docs/3.1/api/parse.html#pythainlp.parse.dependency_parsing) |
+| `word_vector` | `bool` | `True` | Enable/disable word vectors |
+| `word_vector_model` | `str` | `"thai2fit_wv"` | Word vector model. [See options](https://pythainlp.github.io/docs/3.1/api/word_vector.html#pythainlp.word_vector.WordVector) |
+
+**Important Notes:**
+- When `dependency_parsing` is enabled, word segmentation and sentence segmentation are automatically disabled to use the tokenization from the dependency parser.
+- All configuration options are optional and have sensible defaults.
+
+## Resources
+
+- [PyThaiNLP Documentation](https://pythainlp.github.io/)
+- [spaCy Documentation](https://spacy.io/)
+- [GitHub Repository](https://github.com/PyThaiNLP/spaCy-PyThaiNLP)
+- [Issue Tracker](https://github.com/PyThaiNLP/spaCy-PyThaiNLP/issues)
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
 
 ## License
 
diff --git a/setup.py b/setup.py
@@ -21,7 +21,7 @@
     author_email="wannaphong@yahoo.com",
     url="https://github.com/PyThaiNLP/spaCy-PyThaiNLP",
     packages=["spacy_pythainlp"],
-    python_requires=">=3.7",
+    python_requires=">=3.9",
     include_package_data=True,
     install_requires=requirements,
     license="Apache Software License 2.0",