Description
Two algorithmic bugs in metasound() verified against the original paper:
Snae & Brückner. (2009). Issues in Informing Science and Information Technology, Vol. 6, pp. 497–515.
http://iisit.org/Vol6/IISITv6p497-515Snae620.pdf
Bug 1: Truncation before space removal (line 82)
After removing karan (์) by replacing characters with spaces, the algorithm truncates to the requested length BEFORE filtering those spaces. Spaces occupy slots in the length limit, pushing real consonants out.
Bug 2: 4 consonants missing from _C2 (line 19)
The paper's Table on page 507 defines group 2 (D sound) with 18 consonants: จ ฉ ช ซ ฌ ฎ ฏ ฐ ฑ ฒ ด ต ถ ท ธ ศ ษ ส. The implementation has only 14 — missing ฏ, ฑ, ถ, ธ. The paper's PHP source code in the Appendix (pp. 513–514) confirms all 18.
Also: duplicate ข in prayut_and_somchaip._C2 (line 21). No functional impact.
Note on ห, อ, ฮ: These mapping to code 0 is correct per the paper — the 8-group scheme has no /h/ or glottal class (only 41 of 44 consonants classified). However, Complete Soundex (Tapsai et al., 2020) addresses this limitation with 27 initial consonant groups. For users needing more precise matching, complete_soundex is the better choice.
Steps to reproduce
from pythainlp.soundex import metasound
# Bug 1: truncation order
metasound("สรรค์พล", 4) # 'ส550' — wrong, should be 'ส553'
metasound("รักษ์นา", 4) # 'ร100' — wrong, should be 'ร150'
# Bug 2: missing consonants
metasound("กถน", 4) # 'ก050' — ถ maps to '0', should be '2'
metasound("กธน", 4) # 'ก050' — ธ maps to '0', should be '2'
PyThaiNLP version
5.3.3
References
Description
Two algorithmic bugs in
metasound()verified against the original paper:Bug 1: Truncation before space removal (line 82)
After removing karan (์) by replacing characters with spaces, the algorithm truncates to the requested length BEFORE filtering those spaces. Spaces occupy slots in the length limit, pushing real consonants out.
Bug 2: 4 consonants missing from
_C2(line 19)The paper's Table on page 507 defines group 2 (D sound) with 18 consonants: จ ฉ ช ซ ฌ ฎ ฏ ฐ ฑ ฒ ด ต ถ ท ธ ศ ษ ส. The implementation has only 14 — missing ฏ, ฑ, ถ, ธ. The paper's PHP source code in the Appendix (pp. 513–514) confirms all 18.
Also: duplicate ข in
prayut_and_somchaip._C2(line 21). No functional impact.Note on ห, อ, ฮ: These mapping to code 0 is correct per the paper — the 8-group scheme has no /h/ or glottal class (only 41 of 44 consonants classified). However, Complete Soundex (Tapsai et al., 2020) addresses this limitation with 27 initial consonant groups. For users needing more precise matching,
complete_soundexis the better choice.Steps to reproduce
PyThaiNLP version
5.3.3
References