Skip to content

bug: word_detokenize and sent_tokenize crash on edge-case list inputs #1373

@phoneee

Description

@phoneee

Description

Three edge-case crashes in pythainlp.tokenize public functions:

  1. word_detokenize(["สวัสดี", "", "ครับ"])IndexError — accesses w[0] on empty string token (line 71)
  2. word_detokenize([])IndexError — accesses segments[0] on empty list (line 55)
  3. sent_tokenize(["สวัสดี", 123])TypeError uncaught — code catches ValueError but str.join() raises TypeError (line 491)

Expected results

word_detokenize(["สวัสดี", "", "ครับ"])  # → "สวัสดีครับ"
word_detokenize([])                       # → ""
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline")  # → []

Current results

word_detokenize(["สวัสดี", "", "ครับ"])  # IndexError: string index out of range
word_detokenize([])                       # IndexError: list index out of range
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline")  # TypeError (uncaught)

Steps to reproduce

from pythainlp.tokenize import word_detokenize, sent_tokenize

word_detokenize(["สวัสดี", "", "ครับ"])     # crash
word_detokenize([])                          # crash
sent_tokenize(["สวัสดี", 123], engine="whitespace+newline")  # crash

PyThaiNLP version

5.3.3

Python version

3.13

Operating system and version

macOS

Possible solution

  • Line 55: Add if not segments: return "" guard
  • Line 71: Add if not w: continue before w[0]
  • Line 491: Change except ValueError to except TypeError (restores original from commit 2a95070, broken by 5bbf410)

Files

  • pythainlp/tokenize/core.py (lines 55, 71, 491)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions