Skip to content

bug: metasound missing 4 consonants from _C2 per paper; truncation before space removal #1383

@phoneee

Description

@phoneee

Description

Two algorithmic bugs in metasound() verified against the original paper:

Snae & Brückner. (2009). Issues in Informing Science and Information Technology, Vol. 6, pp. 497–515.
http://iisit.org/Vol6/IISITv6p497-515Snae620.pdf

Bug 1: Truncation before space removal (line 82)

After removing karan (์) by replacing characters with spaces, the algorithm truncates to the requested length BEFORE filtering those spaces. Spaces occupy slots in the length limit, pushing real consonants out.

Bug 2: 4 consonants missing from _C2 (line 19)

The paper's Table on page 507 defines group 2 (D sound) with 18 consonants: จ ฉ ช ซ ฌ ฎ ฒ ด ต ศ ษ ส. The implementation has only 14 — missing ฏ, ฑ, ถ, ธ. The paper's PHP source code in the Appendix (pp. 513–514) confirms all 18.

Also: duplicate ข in prayut_and_somchaip._C2 (line 21). No functional impact.

Note on ห, อ, ฮ: These mapping to code 0 is correct per the paper — the 8-group scheme has no /h/ or glottal class (only 41 of 44 consonants classified). However, Complete Soundex (Tapsai et al., 2020) addresses this limitation with 27 initial consonant groups. For users needing more precise matching, complete_soundex is the better choice.

Steps to reproduce

from pythainlp.soundex import metasound

# Bug 1: truncation order
metasound("สรรค์พล", 4)   # 'ส550' — wrong, should be 'ส553'
metasound("รักษ์นา", 4)   # 'ร100' — wrong, should be 'ร150'

# Bug 2: missing consonants
metasound("กถน", 4)       # 'ก050' — ถ maps to '0', should be '2'
metasound("กธน", 4)       # 'ก050' — ธ maps to '0', should be '2'

PyThaiNLP version

5.3.3

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbugs in the library

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions