Skip to content

fix: guard nighit, check_sara, check_marttra against empty input#1377

Open
phoneee wants to merge 2 commits intoPyThaiNLP:devfrom
phoneee:fix/empty-string-guards
Open

fix: guard nighit, check_sara, check_marttra against empty input#1377
phoneee wants to merge 2 commits intoPyThaiNLP:devfrom
phoneee:fix/empty-string-guards

Conversation

@phoneee
Copy link
Copy Markdown
Contributor

@phoneee phoneee commented Mar 29, 2026

What do these changes do

Fix check_sara, check_marttra, and nighit crashing on empty string or vowel-only input

Fixes #1376

  • Passed code styles and structures
  • Passed code linting checks and unit test

nighit() crashed with IndexError when w2 had no consonants.
check_sara("") and check_marttra("") crashed accessing word[-1].
Add input validation with clear error messages.
@bact bact added the bug bugs in the library label Mar 29, 2026
@bact bact added this to PyThaiNLP Mar 29, 2026
@bact bact moved this to In progress in PyThaiNLP Mar 29, 2026
@bact bact requested a review from Copilot March 29, 2026 20:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Guards three Thai-language public helpers (KhaveeVerifier.check_sara, KhaveeVerifier.check_marttra, and morpheme.nighit) against previously crashing edge-cases (empty input / no-consonant input), addressing issue #1376.

Changes:

  • Add early empty-input returns to check_sara and check_marttra.
  • Make nighit raise a ValueError when w2 contains no Thai consonants (instead of crashing).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
pythainlp/morpheme/word_formation.py Adds a safe guard for nighit() when w2 has no Thai consonants.
pythainlp/khavee/core.py Adds empty-input guards to avoid IndexError in check_sara() and check_marttra().

Comment on lines +57 to +58
if not word:
return ""
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new empty-input behaviors (returning "" for check_sara/check_marttra) are not covered by existing khavee tests. Add unit tests that exercise empty-string input to prevent regressions back to IndexError.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +46
consonants_in_w2 = [i for i in list_w2 if i in set(thai_consonants)]
if not consonants_in_w2:
raise ValueError(
f"w2 '{w2}' contains no Thai consonants."
)
consonant_start = consonants_in_w2[0]
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new branch (raising ValueError when w2 contains no Thai consonants) isn’t covered by the existing nighit tests. Add a unit test for a vowel-only (or empty) w2 to ensure it raises ValueError instead of regressing to IndexError.

Copilot uses AI. Check for mistakes.

# In case of การันย์
if "์" in word[-1]:
word = word[:-2]
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After stripping a trailing karun (word = word[:-2]), word can become empty (e.g. input like "ก์"). The function later indexes word[-1] (e.g. in the ออ handling), which will still raise IndexError. Consider re-checking if not word immediately after the karun-strip (and returning an appropriate value) to fully guard against empty-after-normalization cases.

Suggested change
word = word[:-2]
word = word[:-2]
# After removing the karun, the word may become empty (e.g. "ก์")
if not word:
return ""

Copilot uses AI. Check for mistakes.
countoa = 0

if not word:
return ""
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an empty string for empty input is inconsistent with the existing error signaling in this API (e.g. returning "Can't find Sara in this word" when no vowel is detected). Consider either returning the same "Can't find…" string for empty input as well, or raising a ValueError, so callers don't need to special-case "".

Suggested change
return ""
return "Can't find Sara in this word"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error/warning message should go to standard error/warning raises, not the return value.

word = remove_tonemark(word)

if not word:
return ""
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an empty string for empty input is inconsistent with the existing error signaling in this API (e.g. returning "Can't find Marttra in this word" for unclassified words). Consider returning the same "Can't find…" string or raising a ValueError to avoid an ambiguous "" result.

Suggested change
return ""
return "Can't find Marttra in this word"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error/warning message should go to standard error/warning raises, not the return value.

newword.append(list_w1[0])
newword.append("ั")
consonant_start = [i for i in list_w2 if i in set(thai_consonants)][0]
consonants_in_w2 = [i for i in list_w2 if i in set(thai_consonants)]
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set(thai_consonants) is being constructed inside the list comprehension, which recreates the set for each element of list_w2. Build the set once (e.g. at module scope or once per function call) and reuse it for membership checks.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug bugs in the library

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

bug: check_sara, check_marttra, nighit crash on empty or vowel-only input

3 participants