| SPDX-FileCopyrightText | 2025-2026 PyThaiNLP Project |
|---|---|
| SPDX-FileType | DOCUMENTATION |
| SPDX-License-Identifier | CC0-1.0 |
All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Full release notes: https://github.com/PyThaiNLP/pythainlp/releases
- Commit history: https://github.com/PyThaiNLP/pythainlp/compare/v5.3.3...v5.3.4
- Improve guardrails in
check_sara()andnighit()
5.3.4 - 2026-04-02
- Value range check fixes (#1374, #1379, #1382).
- "1001" -> "หนึ่งพันเอ็ด" rule (#1386).
- Build WSD Trie after populating dictionary (#1388).
- Doctests across all modules (#1392).
5.3.3 - 2026-03-26
Security fixes and thai2rom_onnx bug fixes.
-
EntitySpanTypedDict to allow type checking of tagged entity members (#1363).Migration notes:
# Before (plain dict) from pythainlp.tag.thai_nner import get_top_level_entities entities = [ {"text": ["ห้า"], "span": [7, 9], "entity_type": "cardinal"}, {"text": ["ห้า", "โมง"], "span": [7, 11], "entity_type": "time"}, {"text": ["โมง"], "span": [9, 11], "entity_type": "unit"}, ] top_entities = get_top_level_entities(entities) # After (TypedDict) from pythainlp.tag.named_entity import EntitySpan from pythainlp.tag.thai_nner import get_top_level_entities entities = [ EntitySpan(text=["ห้า"], span=[7, 9], entity_type="cardinal"), EntitySpan(text=["ห้า", "โมง"], span=[7, 11], entity_type="time"), EntitySpan(text=["โมง"], span=[9, 11], entity_type="unit"), ] top_entities = get_top_level_entities(entities)
- thai2rom_onnx: fix ONNX encoder model and fix inference bugs (#1349)
- wordnet: fix AttributeError (#1354)
- Replace
os.path.joinwithsafe_path_jointhroughout the codebase to prevent path manipulation vulnerabilities (CWE-22) (#1369)
5.3.2 - 2026-03-19
This release focuses on security improvements related to path traversal and renaming functions to conform with PEP 8 and follow NLTK convention. Old function names are still accessible, but migration to new names are recommended as old function names will be removed in a future version.
pythainlp.chunkmodule: canonical home for chunking/phrase-structure parsing, following the NLTKnltk.chunknaming convention.
The following names are deprecated and will be removed in 6.0 (#1339):
pythainlp.util.isthaichar(): usepythainlp.util.is_thai_char().pythainlp.util.isthai(): usepythainlp.util.is_thai().pythainlp.util.countthai(): usepythainlp.util.count_thai().pythainlp.tag.crfchunk.CRFchunk: usepythainlp.chunk.CRFChunkParser.pythainlp.tag.chunk_parse(): usepythainlp.chunk.chunk_parse().
- Prevent path traversal: validate that paths stay within their expected base directory (#1342)
5.3.1 - 2026-03-14
This release focuses on security issues related to corpus file loading.
- thai2fit: Use JSON model instead of pickle (#1325)
- Defensive corpus loading: validate fields before processing (#1327)
- w2p: Use npz model instead of pickle (#1328)
5.3.0 - 2026-03-10
This release focuses on stability and performance, featuring optimized memory efficiency, better read-only environment support, and standardized error messaging. We’ve expanded our test suite to include Python 3.14 and broadened type hint support for a better developer experience.
The minimum requirement is now Python 3.9.
- Tapsai et al. 2020 soundex (#1175)
- Thai profanity detection (#1183)
- Qwen3-0.6B language model (#1217)
- Thai-NNER integration with top-level entity filtering (#1221)
pythainlp.braillemodule for Thai braille conversion (#1287)- BLEU, ROUGE, WER, and CER metrics to
pythainlp.benchmarks(#1295) - Attaparse engine to dependency parser
(
dependency_parsing, engine="attaparse") (#1303) pythainlp.is_offline_mode()helper function; usePYTHAINLP_OFFLINE=1to disable automatic corpus downloads (#1306)- Thai consonant cluster detection (
check_khuap_klam) (#1308) pythainlp.is_read_only_mode()helper function; usePYTHAINLP_READ_ONLY=1to prevent all write operations (#1317)
- Optimized for performance (#1182, #1237, #1320)
- Lazy load dictionaries to reduce memory usage (#1186)
- Migrate configurations to
pyproject.toml(#1188, #1190, #1226, #1239) - Update type hints; use Python 3.9 features (#1189, #1190, #1232, #1262, #1263, #1264, #1274, etc.)
- Make package zip-safe (#1212)
- Ensure thread-safety for tokenizers (#1213)
- Replace TNC word frequency dataset with Phupha filtered by ORST words (#1284)
- Reorganize "noauto" test suite by dependency groups (torch, tensorflow, onnx, cython, network) (#1290)
get_corpus_path()now respectsPYTHAINLP_OFFLINEenv var (followsHF_HUB_OFFLINEconvention from Hugging Face): raisesFileNotFoundErrorif the corpus is not cached locally when the var is set; auto-downloads otherwise (#1306)- Callers raise
FileNotFoundErrorwith download instructions when a corpus path cannot be resolved (#1306) - Migrate build backend to
hatchling(#1311)
PYTHAINLP_DATA_DIRenv var; usePYTHAINLP_DATAinstead (followsNLTK_DATAconvention from NLTK)PYTHAINLP_DATA_DIRwill be removed in a future version (#1306)PYTHAINLP_READ_MODEenv var; usePYTHAINLP_READ_ONLYinsteadPYTHAINLP_READ_MODEwill be removed in a future version (#1317)
- Duplicated entries in Volubilis dictionary (#1200)
- Star imports (#1207)
requestsdependency (#1211)pythainlp.util.is_native_thai(deprecated since v5.0); usepythainlp.morpheme.is_native_thaiinstead (#1315)
royinromanization: Consonant cluster boundary (#1172)check_marttra(): Final consonant classification (#1173)- Base dependencies (#1185)
tltktransliteration: Kho Khon alphabet issue in (#1187)- Fix tone_detector and sound_syllable bugs (#1197)
normalize(): Remove spaces before tone marks and non-base characters (#1222)- Suppress Gensim duplicate-word warnings when loading word2vec binary files (#1316)
db.json: created lazily only when a corpus is first downloaded (#1317)newmmtokenization: Exponential-time explosion when text has many ambiguous breaking points (#1319)Trie: Reduce memory usage and faster TCC boundary lookups (#1323)
- Prevent path traversal and symlink attacks in archive extraction (#1225)
5.2.0 - 2025-12-20
pythainlp.translate.word_translate(#1102)- Words spelling correction using Char2Vec (#1075)
- Thailand ancient currency converter (#1113)
- B-K/umt5-thai-g2p-v2-0.5k model (#1140)
- budoux integration (#1161)
- Docker Compose file for convenience (#1132)
- Update Dockerfile (#1049)
- ConceptNet integration (#1103)
- Connectivity of CLI commands (#1154)
- Docker build failure (#1132)
5.1.2 - 2025-05-09
- Romanize docs; keep space (#1110)
5.1.1 - 2025-03-31
- Refactor
thai_consonants_allto use set insyllable.py(#1087) ThaiTransliterator: select 1D CPU int64 tensor device (#1089)
5.1.0 - 2025-02-25
- Thai Discourse Treebank POS tag (#910)
- Thai Universal Dependency Treebank POS tag (#916)
- Thai G2P v2 grapheme-to-phoneme model (#923)
- Support for list of strings as input to
sent_tokenize()(#927) pythainlp.tools.safe_printto handleUnicodeEncodeErroron console (#969)- Thai Solar Date to Thai Lunar Date conversion (#998)
- Thai pangram text (#1045)
clause_tokenize(#1024)
collate()to consider tone mark in ordering (#926)nlpo3.load_dict()not printing error when unsuccessful (#979)
5.0.5 - 2024-12-14
- Add
clause_tokenizedeprecation warnings (#1026)
maiyamok()expanding the wrong word (#962)
5.0.4 - 2024-06-02
pythainlp.util.maiyamoknot duplicating words when more than one Maiyamok is used (#917)
5.0.3 - 2024-05-12
- Empty string added when using
word_tokenizewithjoin_broken_num=True(#912)
5.0.2 - 2024-04-03
crfcut: ensure splitting of sentences using terminal punctuation (#905)
5.0.1 - 2024-02-10
- Delay calling
syllable_tokenizeto avoidpycrfsuiteimport error (#901)