Add ThaiG2P v3 ONNX engine to transliterate by Copilot · Pull Request #1399 · PyThaiNLP/pythainlp

Copilot · 2026-04-04T13:16:51Z

ThaiG2P v3 is a char-level Transformer G2P model exported to ONNX, offering improved training data over v2 with a smaller model footprint.

What do these changes do

pythainlp/corpus/thaig2p_v3_encoder.onnx — Bundled encoder ONNX model
pythainlp/corpus/thaig2p_v3_decoder.onnx — Bundled decoder ONNX model
pythainlp/corpus/thaig2p_v3_vocab.json — Bundled vocabulary (character-to-index mapping)
pythainlp/corpus/default_db.json — Added entries for the three bundled model files
pythainlp/transliterate/thaig2p_v3.py — New module: ONNX inference via onnxruntime with greedy decoder loop; loads model files from pythainlp/corpus via get_corpus_path()
pythainlp/transliterate/core.py — Added thaig2p_v3 engine dispatch + docstring entry
tests/extra/testx_transliterate.py — Added thaig2p_v3 test cases alongside existing v2/umt5 tests
docs/api/transliterate.rst — Documented the new engine
CHANGELOG.md — Logged under [Unreleased]

Usage

from pythainlp.transliterate import transliterate

transliterate("สวัสดี", engine="thaig2p_v3")
# '/sa˨˩.wat̚˨˩.diː˧/'

transliterate("ภาษาไทย", engine="thaig2p_v3")
# '/pʰaː˧.saː˩˩˦.tʰaj˧/'

What was wrong

ThaiG2P v3 existed upstream (wannaphong/thai-g2p-v3) but was not integrated into PyThaiNLP.

How this fixes it

Adds thaig2p_v3 as a first-class transliterate() engine. Unlike v2 (HuggingFace Transformers pipeline), v3 uses onnxruntime directly, matching the ONNX-first design of the upstream model. The three model files (thaig2p_v3_encoder.onnx, thaig2p_v3_decoder.onnx, thaig2p_v3_vocab.json) are bundled in pythainlp/corpus/ and registered in default_db.json, consistent with how thai2rom_onnx is packaged. The module loads them via get_corpus_path() with no custom download logic required.

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

wannaphong · 2026-04-04T13:18:10Z

@copilot Github v3: https://github.com/wannaphong/thai-g2p-v3

wannaphong · 2026-04-04T13:19:54Z

@copilot Example:

import onnxruntime as ort
import numpy as np
import json

# Load Vocab
with open("thai2ipa_vocab.json", "r", encoding="utf-8") as f:
    vocab = json.load(f)
char2idx = vocab["input_char2idx"]
idx2char = {int(k): v for k, v in vocab["target_idx2char"].items()}

# Load ONNX Sessions
encoder_sess = ort.InferenceSession("thai2ipa_encoder.onnx", providers=['CPUExecutionProvider'])
decoder_sess = ort.InferenceSession("thai2ipa_decoder.onnx", providers=['CPUExecutionProvider'])

def transliterate_onnx(text, max_len=50):
    # 1. Prepare Input
    src = [1] + [char2idx.get(c, 3) for c in text] + [2]  # <SOS>, chars, <EOS>
    src_tensor = np.array([src], dtype=np.int64)
    
    # 2. Run Encoder
    enc_outputs = encoder_sess.run(None, {"src": src_tensor})
    memory, src_pad_mask = enc_outputs[0], enc_outputs[1]
    
    # 3. Autoregressive Decoder Loop (Greedy Search)
    trg_indexes = [1] # Start with <SOS>
    
    for _ in range(max_len):
        trg_tensor = np.array([trg_indexes], dtype=np.int64)
        
        # Run Decoder
        dec_outputs = decoder_sess.run(None, {
            "trg": trg_tensor,
            "memory": memory,
            "src_pad_mask": src_pad_mask
        })
        
        # dec_outputs[0] shape: (Batch, Trg_Len, Vocab_Size)
        # Get the highest probability token for the LAST generated step
        next_token_logits = dec_outputs[0][0, -1, :]
        next_token = int(np.argmax(next_token_logits))
        
        if next_token == 2: # <EOS> reached
            break
            
        trg_indexes.append(next_token)
        
    # 4. Decode to String
    result = "".join([idx2char[idx] for idx in trg_indexes[1:]])
    return result

print(transliterate_onnx("สวัสดี"))
print(transliterate_onnx("สวัสดีค่ะ"))
print(transliterate_onnx("คน"))
print(transliterate_onnx("ดี"))
print(transliterate_onnx("วรรณพงษ์"))
print(transliterate_onnx("ราชกิจ"))

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/de31de74-3c85-4d05-bd45-6f719e129da4 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

wannaphong · 2026-04-04T13:26:39Z

@copilot Include model files to pythainlp/corpus

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/de31de74-3c85-4d05-bd45-6f719e129da4 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/95ab0043-37c8-4570-94e3-92d0467ac853 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot · 2026-04-04T13:31:20Z

@copilot Include model files to pythainlp/corpus

Done in commit 7749340. The three model files (thaig2p_v3_encoder.onnx, thaig2p_v3_decoder.onnx, thaig2p_v3_vocab.json) are now bundled in pythainlp/corpus/ with entries added to default_db.json. The thaig2p_v3.py module now uses get_corpus_path() to load them, consistent with how thai2rom_onnx works.

coveralls · 2026-04-04T13:38:20Z

coverage: 66.26% (-0.4%) from 66.633%
when pulling a21b0f9 on copilot/add-thai-g2p-v3-to-pythainlp
into 3152a36 on dev.

sonarqubecloud · 2026-04-04T13:42:53Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Initial plan

d7c01ed

Copilot AI assigned Copilot and wannaphong Apr 4, 2026

Copilot started work on behalf of wannaphong April 4, 2026 13:16 View session

Copilot AI linked an issue Apr 4, 2026 that may be closed by this pull request

Add thai-g2p-v3 to PyThaiNLP #1398

Open

Add ThaiG2P v3 ONNX engine (thaig2p_v3) to transliterate

ea36853

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/de31de74-3c85-4d05-bd45-6f719e129da4 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Fix review issues: use logging, consistent typing, remove noqa comment

9c01d4f

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/de31de74-3c85-4d05-bd45-6f719e129da4 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add ThaiG2P v3 model file to PyThaiNLP~~ Add ThaiG2P v3 ONNX engine to transliterate Apr 4, 2026

Copilot AI requested a review from wannaphong April 4, 2026 13:27

Copilot finished work on behalf of wannaphong April 4, 2026 13:27

Copilot started work on behalf of wannaphong April 4, 2026 13:27 View session

Bundle ThaiG2P v3 ONNX model files in pythainlp/corpus

7749340

Agent-Logs-Url: https://github.com/PyThaiNLP/pythainlp/sessions/95ab0043-37c8-4570-94e3-92d0467ac853 Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot finished work on behalf of wannaphong April 4, 2026 13:32

wannaphong added 2 commits April 4, 2026 20:33

Update DEFAULT_TRANSLITERATE_ENGINE to thaig2p_v3

9f47192

Update thaig2p_v3 description to indicate default model

6ce1f19

Remove duplicate option for thaig2p_v3 engine

a21b0f9

wannaphong added this to the 6.0 milestone Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ThaiG2P v3 ONNX engine to transliterate#1399

Add ThaiG2P v3 ONNX engine to transliterate#1399
Copilot wants to merge 7 commits intodevfrom
copilot/add-thai-g2p-v3-to-pythainlp

Copilot AI commented Apr 4, 2026 •

edited

Loading

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

Copilot AI commented Apr 4, 2026

Uh oh!

coveralls commented Apr 4, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Copilot AI commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do

Usage

What was wrong

How this fixes it

Your checklist for this pull request

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

wannaphong commented Apr 4, 2026

Uh oh!

Copilot AI commented Apr 4, 2026

Uh oh!

coveralls commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 4, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 4, 2026 •

edited

Loading

coveralls commented Apr 4, 2026 •

edited

Loading