Skip to content

Add ThaiG2P v3 ONNX engine to transliterate#1399

Draft
Copilot wants to merge 7 commits intodevfrom
copilot/add-thai-g2p-v3-to-pythainlp
Draft

Add ThaiG2P v3 ONNX engine to transliterate#1399
Copilot wants to merge 7 commits intodevfrom
copilot/add-thai-g2p-v3-to-pythainlp

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 4, 2026

ThaiG2P v3 is a char-level Transformer G2P model exported to ONNX, offering improved training data over v2 with a smaller model footprint.

What do these changes do

  • pythainlp/corpus/thaig2p_v3_encoder.onnx — Bundled encoder ONNX model
  • pythainlp/corpus/thaig2p_v3_decoder.onnx — Bundled decoder ONNX model
  • pythainlp/corpus/thaig2p_v3_vocab.json — Bundled vocabulary (character-to-index mapping)
  • pythainlp/corpus/default_db.json — Added entries for the three bundled model files
  • pythainlp/transliterate/thaig2p_v3.py — New module: ONNX inference via onnxruntime with greedy decoder loop; loads model files from pythainlp/corpus via get_corpus_path()
  • pythainlp/transliterate/core.py — Added thaig2p_v3 engine dispatch + docstring entry
  • tests/extra/testx_transliterate.py — Added thaig2p_v3 test cases alongside existing v2/umt5 tests
  • docs/api/transliterate.rst — Documented the new engine
  • CHANGELOG.md — Logged under [Unreleased]

Usage

from pythainlp.transliterate import transliterate

transliterate("สวัสดี", engine="thaig2p_v3")
# '/sa˨˩.wat̚˨˩.diː˧/'

transliterate("ภาษาไทย", engine="thaig2p_v3")
# '/pʰaː˧.saː˩˩˦.tʰaj˧/'

What was wrong

ThaiG2P v3 existed upstream (wannaphong/thai-g2p-v3) but was not integrated into PyThaiNLP.

How this fixes it

Adds thaig2p_v3 as a first-class transliterate() engine. Unlike v2 (HuggingFace Transformers pipeline), v3 uses onnxruntime directly, matching the ONNX-first design of the upstream model. The three model files (thaig2p_v3_encoder.onnx, thaig2p_v3_decoder.onnx, thaig2p_v3_vocab.json) are bundled in pythainlp/corpus/ and registered in default_db.json, consistent with how thai2rom_onnx is packaged. The module loads them via get_corpus_path() with no custom download logic required.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test

Copilot AI linked an issue Apr 4, 2026 that may be closed by this pull request
@wannaphong
Copy link
Copy Markdown
Member

@wannaphong
Copy link
Copy Markdown
Member

@copilot Example:

import onnxruntime as ort
import numpy as np
import json

# Load Vocab
with open("thai2ipa_vocab.json", "r", encoding="utf-8") as f:
    vocab = json.load(f)
char2idx = vocab["input_char2idx"]
idx2char = {int(k): v for k, v in vocab["target_idx2char"].items()}

# Load ONNX Sessions
encoder_sess = ort.InferenceSession("thai2ipa_encoder.onnx", providers=['CPUExecutionProvider'])
decoder_sess = ort.InferenceSession("thai2ipa_decoder.onnx", providers=['CPUExecutionProvider'])

def transliterate_onnx(text, max_len=50):
    # 1. Prepare Input
    src = [1] + [char2idx.get(c, 3) for c in text] + [2]  # <SOS>, chars, <EOS>
    src_tensor = np.array([src], dtype=np.int64)
    
    # 2. Run Encoder
    enc_outputs = encoder_sess.run(None, {"src": src_tensor})
    memory, src_pad_mask = enc_outputs[0], enc_outputs[1]
    
    # 3. Autoregressive Decoder Loop (Greedy Search)
    trg_indexes = [1] # Start with <SOS>
    
    for _ in range(max_len):
        trg_tensor = np.array([trg_indexes], dtype=np.int64)
        
        # Run Decoder
        dec_outputs = decoder_sess.run(None, {
            "trg": trg_tensor,
            "memory": memory,
            "src_pad_mask": src_pad_mask
        })
        
        # dec_outputs[0] shape: (Batch, Trg_Len, Vocab_Size)
        # Get the highest probability token for the LAST generated step
        next_token_logits = dec_outputs[0][0, -1, :]
        next_token = int(np.argmax(next_token_logits))
        
        if next_token == 2: # <EOS> reached
            break
            
        trg_indexes.append(next_token)
        
    # 4. Decode to String
    result = "".join([idx2char[idx] for idx in trg_indexes[1:]])
    return result

print(transliterate_onnx("สวัสดี"))
print(transliterate_onnx("สวัสดีค่ะ"))
print(transliterate_onnx("คน"))
print(transliterate_onnx("ดี"))
print(transliterate_onnx("วรรณพงษ์"))
print(transliterate_onnx("ราชกิจ"))

@wannaphong
Copy link
Copy Markdown
Member

@copilot Include model files to pythainlp/corpus

Copilot AI changed the title [WIP] Add ThaiG2P v3 model file to PyThaiNLP Add ThaiG2P v3 ONNX engine to transliterate Apr 4, 2026
Copilot AI requested a review from wannaphong April 4, 2026 13:27
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 4, 2026

@copilot Include model files to pythainlp/corpus

Done in commit 7749340. The three model files (thaig2p_v3_encoder.onnx, thaig2p_v3_decoder.onnx, thaig2p_v3_vocab.json) are now bundled in pythainlp/corpus/ with entries added to default_db.json. The thaig2p_v3.py module now uses get_corpus_path() to load them, consistent with how thai2rom_onnx works.

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 4, 2026

Coverage Status

coverage: 66.26% (-0.4%) from 66.633%
when pulling a21b0f9 on copilot/add-thai-g2p-v3-to-pythainlp
into 3152a36 on dev.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 4, 2026

@wannaphong wannaphong added this to the 6.0 milestone Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thai-g2p-v3 to PyThaiNLP

3 participants