fix: guard nighit, check_sara, check_marttra against empty input#1377
fix: guard nighit, check_sara, check_marttra against empty input#1377phoneee wants to merge 2 commits intoPyThaiNLP:devfrom
Conversation
nighit() crashed with IndexError when w2 had no consonants.
check_sara("") and check_marttra("") crashed accessing word[-1].
Add input validation with clear error messages.
There was a problem hiding this comment.
Pull request overview
Guards three Thai-language public helpers (KhaveeVerifier.check_sara, KhaveeVerifier.check_marttra, and morpheme.nighit) against previously crashing edge-cases (empty input / no-consonant input), addressing issue #1376.
Changes:
- Add early empty-input returns to
check_saraandcheck_marttra. - Make
nighitraise aValueErrorwhenw2contains no Thai consonants (instead of crashing).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
pythainlp/morpheme/word_formation.py |
Adds a safe guard for nighit() when w2 has no Thai consonants. |
pythainlp/khavee/core.py |
Adds empty-input guards to avoid IndexError in check_sara() and check_marttra(). |
| if not word: | ||
| return "" |
There was a problem hiding this comment.
These new empty-input behaviors (returning "" for check_sara/check_marttra) are not covered by existing khavee tests. Add unit tests that exercise empty-string input to prevent regressions back to IndexError.
| consonants_in_w2 = [i for i in list_w2 if i in set(thai_consonants)] | ||
| if not consonants_in_w2: | ||
| raise ValueError( | ||
| f"w2 '{w2}' contains no Thai consonants." | ||
| ) | ||
| consonant_start = consonants_in_w2[0] |
There was a problem hiding this comment.
This new branch (raising ValueError when w2 contains no Thai consonants) isn’t covered by the existing nighit tests. Add a unit test for a vowel-only (or empty) w2 to ensure it raises ValueError instead of regressing to IndexError.
|
|
||
| # In case of การันย์ | ||
| if "์" in word[-1]: | ||
| word = word[:-2] |
There was a problem hiding this comment.
After stripping a trailing karun (word = word[:-2]), word can become empty (e.g. input like "ก์"). The function later indexes word[-1] (e.g. in the ออ handling), which will still raise IndexError. Consider re-checking if not word immediately after the karun-strip (and returning an appropriate value) to fully guard against empty-after-normalization cases.
| word = word[:-2] | |
| word = word[:-2] | |
| # After removing the karun, the word may become empty (e.g. "ก์") | |
| if not word: | |
| return "" |
| countoa = 0 | ||
|
|
||
| if not word: | ||
| return "" |
There was a problem hiding this comment.
Returning an empty string for empty input is inconsistent with the existing error signaling in this API (e.g. returning "Can't find Sara in this word" when no vowel is detected). Consider either returning the same "Can't find…" string for empty input as well, or raising a ValueError, so callers don't need to special-case "".
| return "" | |
| return "Can't find Sara in this word" |
There was a problem hiding this comment.
Error/warning message should go to standard error/warning raises, not the return value.
| word = remove_tonemark(word) | ||
|
|
||
| if not word: | ||
| return "" |
There was a problem hiding this comment.
Returning an empty string for empty input is inconsistent with the existing error signaling in this API (e.g. returning "Can't find Marttra in this word" for unclassified words). Consider returning the same "Can't find…" string or raising a ValueError to avoid an ambiguous "" result.
| return "" | |
| return "Can't find Marttra in this word" |
There was a problem hiding this comment.
Error/warning message should go to standard error/warning raises, not the return value.
| newword.append(list_w1[0]) | ||
| newword.append("ั") | ||
| consonant_start = [i for i in list_w2 if i in set(thai_consonants)][0] | ||
| consonants_in_w2 = [i for i in list_w2 if i in set(thai_consonants)] |
There was a problem hiding this comment.
set(thai_consonants) is being constructed inside the list comprehension, which recreates the set for each element of list_w2. Build the set once (e.g. at module scope or once per function call) and reuse it for membership checks.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|



What do these changes do
Fix
check_sara,check_marttra, andnighitcrashing on empty string or vowel-only inputFixes #1376