Skip to content

Latest commit

 

History

History
313 lines (221 loc) · 9.74 KB

File metadata and controls

313 lines (221 loc) · 9.74 KB
SPDX-FileCopyrightText 2025-2026 PyThaiNLP Project
SPDX-FileType DOCUMENTATION
SPDX-License-Identifier CC0-1.0

Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Changed

  • Improve guardrails in check_sara() and nighit()

5.3.4 - 2026-04-02

Fixed

  • Value range check fixes (#1374, #1379, #1382).
  • "1001" -> "หนึ่งพันเอ็ด" rule (#1386).
  • Build WSD Trie after populating dictionary (#1388).
  • Doctests across all modules (#1392).

5.3.3 - 2026-03-26

Security fixes and thai2rom_onnx bug fixes.

Added

  • EntitySpan TypedDict to allow type checking of tagged entity members (#1363).

    Migration notes:

    # Before (plain dict)
    from pythainlp.tag.thai_nner import get_top_level_entities
    entities = [
        {"text": ["ห้า"], "span": [7, 9], "entity_type": "cardinal"},
        {"text": ["ห้า", "โมง"], "span": [7, 11], "entity_type": "time"},
        {"text": ["โมง"], "span": [9, 11], "entity_type": "unit"},
    ]
    top_entities = get_top_level_entities(entities)
    
    # After (TypedDict)
    from pythainlp.tag.named_entity import EntitySpan
    from pythainlp.tag.thai_nner import get_top_level_entities
    entities = [
        EntitySpan(text=["ห้า"], span=[7, 9], entity_type="cardinal"),
        EntitySpan(text=["ห้า", "โมง"], span=[7, 11], entity_type="time"),
        EntitySpan(text=["โมง"], span=[9, 11], entity_type="unit"),
    ]
    top_entities = get_top_level_entities(entities)

Fixed

  • thai2rom_onnx: fix ONNX encoder model and fix inference bugs (#1349)
  • wordnet: fix AttributeError (#1354)

Security

  • Replace os.path.join with safe_path_join throughout the codebase to prevent path manipulation vulnerabilities (CWE-22) (#1369)

5.3.2 - 2026-03-19

This release focuses on security improvements related to path traversal and renaming functions to conform with PEP 8 and follow NLTK convention. Old function names are still accessible, but migration to new names are recommended as old function names will be removed in a future version.

Added

  • pythainlp.chunk module: canonical home for chunking/phrase-structure parsing, following the NLTK nltk.chunk naming convention.

Deprecated

The following names are deprecated and will be removed in 6.0 (#1339):

  • pythainlp.util.isthaichar(): use pythainlp.util.is_thai_char().
  • pythainlp.util.isthai(): use pythainlp.util.is_thai().
  • pythainlp.util.countthai(): use pythainlp.util.count_thai().
  • pythainlp.tag.crfchunk.CRFchunk: use pythainlp.chunk.CRFChunkParser.
  • pythainlp.tag.chunk_parse(): use pythainlp.chunk.chunk_parse().

Security

  • Prevent path traversal: validate that paths stay within their expected base directory (#1342)

5.3.1 - 2026-03-14

This release focuses on security issues related to corpus file loading.

Security

  • thai2fit: Use JSON model instead of pickle (#1325)
  • Defensive corpus loading: validate fields before processing (#1327)
  • w2p: Use npz model instead of pickle (#1328)

5.3.0 - 2026-03-10

This release focuses on stability and performance, featuring optimized memory efficiency, better read-only environment support, and standardized error messaging. We’ve expanded our test suite to include Python 3.14 and broadened type hint support for a better developer experience.

The minimum requirement is now Python 3.9.

Added

  • Tapsai et al. 2020 soundex (#1175)
  • Thai profanity detection (#1183)
  • Qwen3-0.6B language model (#1217)
  • Thai-NNER integration with top-level entity filtering (#1221)
  • pythainlp.braille module for Thai braille conversion (#1287)
  • BLEU, ROUGE, WER, and CER metrics to pythainlp.benchmarks (#1295)
  • Attaparse engine to dependency parser (dependency_parsing, engine="attaparse") (#1303)
  • pythainlp.is_offline_mode() helper function; use PYTHAINLP_OFFLINE=1 to disable automatic corpus downloads (#1306)
  • Thai consonant cluster detection (check_khuap_klam) (#1308)
  • pythainlp.is_read_only_mode() helper function; use PYTHAINLP_READ_ONLY=1 to prevent all write operations (#1317)

Changed

  • Optimized for performance (#1182, #1237, #1320)
  • Lazy load dictionaries to reduce memory usage (#1186)
  • Migrate configurations to pyproject.toml (#1188, #1190, #1226, #1239)
  • Update type hints; use Python 3.9 features (#1189, #1190, #1232, #1262, #1263, #1264, #1274, etc.)
  • Make package zip-safe (#1212)
  • Ensure thread-safety for tokenizers (#1213)
  • Replace TNC word frequency dataset with Phupha filtered by ORST words (#1284)
  • Reorganize "noauto" test suite by dependency groups (torch, tensorflow, onnx, cython, network) (#1290)
  • get_corpus_path() now respects PYTHAINLP_OFFLINE env var (follows HF_HUB_OFFLINE convention from Hugging Face): raises FileNotFoundError if the corpus is not cached locally when the var is set; auto-downloads otherwise (#1306)
  • Callers raise FileNotFoundError with download instructions when a corpus path cannot be resolved (#1306)
  • Migrate build backend to hatchling (#1311)

Deprecated

  • PYTHAINLP_DATA_DIR env var; use PYTHAINLP_DATA instead (follows NLTK_DATA convention from NLTK) PYTHAINLP_DATA_DIR will be removed in a future version (#1306)
  • PYTHAINLP_READ_MODE env var; use PYTHAINLP_READ_ONLY instead PYTHAINLP_READ_MODE will be removed in a future version (#1317)

Removed

  • Duplicated entries in Volubilis dictionary (#1200)
  • Star imports (#1207)
  • requests dependency (#1211)
  • pythainlp.util.is_native_thai (deprecated since v5.0); use pythainlp.morpheme.is_native_thai instead (#1315)

Fixed

  • royin romanization: Consonant cluster boundary (#1172)
  • check_marttra(): Final consonant classification (#1173)
  • Base dependencies (#1185)
  • tltk transliteration: Kho Khon alphabet issue in (#1187)
  • Fix tone_detector and sound_syllable bugs (#1197)
  • normalize(): Remove spaces before tone marks and non-base characters (#1222)
  • Suppress Gensim duplicate-word warnings when loading word2vec binary files (#1316)
  • db.json: created lazily only when a corpus is first downloaded (#1317)
  • newmm tokenization: Exponential-time explosion when text has many ambiguous breaking points (#1319)
  • Trie: Reduce memory usage and faster TCC boundary lookups (#1323)

Security

  • Prevent path traversal and symlink attacks in archive extraction (#1225)

5.2.0 - 2025-12-20

Added

  • pythainlp.translate.word_translate (#1102)
  • Words spelling correction using Char2Vec (#1075)
  • Thailand ancient currency converter (#1113)
  • B-K/umt5-thai-g2p-v2-0.5k model (#1140)
  • budoux integration (#1161)
  • Docker Compose file for convenience (#1132)

Changed

  • Update Dockerfile (#1049)

Removed

  • ConceptNet integration (#1103)

Fixed

  • Connectivity of CLI commands (#1154)
  • Docker build failure (#1132)

5.1.2 - 2025-05-09

Changed

  • Romanize docs; keep space (#1110)

5.1.1 - 2025-03-31

Changed

  • Refactor thai_consonants_all to use set in syllable.py (#1087)
  • ThaiTransliterator: select 1D CPU int64 tensor device (#1089)

5.1.0 - 2025-02-25

Added

  • Thai Discourse Treebank POS tag (#910)
  • Thai Universal Dependency Treebank POS tag (#916)
  • Thai G2P v2 grapheme-to-phoneme model (#923)
  • Support for list of strings as input to sent_tokenize() (#927)
  • pythainlp.tools.safe_print to handle UnicodeEncodeError on console (#969)
  • Thai Solar Date to Thai Lunar Date conversion (#998)
  • Thai pangram text (#1045)

Removed

  • clause_tokenize (#1024)

Fixed

  • collate() to consider tone mark in ordering (#926)
  • nlpo3.load_dict() not printing error when unsuccessful (#979)

5.0.5 - 2024-12-14

Changed

  • Add clause_tokenize deprecation warnings (#1026)

Fixed

  • maiyamok() expanding the wrong word (#962)

5.0.4 - 2024-06-02

Fixed

  • pythainlp.util.maiyamok not duplicating words when more than one Maiyamok is used (#917)

5.0.3 - 2024-05-12

Fixed

  • Empty string added when using word_tokenize with join_broken_num=True (#912)

5.0.2 - 2024-04-03

Fixed

  • crfcut: ensure splitting of sentences using terminal punctuation (#905)

5.0.1 - 2024-02-10

Fixed

  • Delay calling syllable_tokenize to avoid pycrfsuite import error (#901)

5.0.0 - 2024-02-10