Add Kannada (kn-IN) G2P support for TTS#15582
Add Kannada (kn-IN) G2P support for TTS#15582jasro23 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
- Add KannadaG2p class with hybrid dictionary + rule-based IPA conversion - Add Kannada grapheme and IPA character sets to ipa_lexicon.py - Add kn-IN locale support with punctuation handling - Include lexicon with 4264 Kannada words - Add test script with assertions for validation The G2P module handles: - All Kannada vowels, consonants, matras (dependent vowels) - Virama (halant), anusvara, visarga - Anusvara place assimilation based on following consonant Signed-off-by: Jason Roche <jas.tech23@gmail.com>
Signed-off-by: jasro23 <jasro23@users.noreply.github.com>
|
@jasro23 Can you also add support for Telugu language |
@annagirimokshith . I am not too familiar with Telugu. |
| # limitations under the License. | ||
|
|
||
| import pathlib | ||
| import re |
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( | ||
| GRAPHEME_CHARACTER_SETS, | ||
| get_grapheme_character_set, | ||
| get_ipa_punctuation_list, | ||
| ) |
There was a problem hiding this comment.
Pull request overview
Adds Kannada (kn-IN) grapheme-to-phoneme (G2P) support for NeMo TTS, including a new Kannada IPA G2P implementation, locale character sets/punctuation, a pronunciation dictionary, and unit tests.
Changes:
- Introduce
KannadaG2pwith hybrid dictionary + rule-based IPA conversion. - Add
kn-INgrapheme and IPA character sets plus locale punctuation handling. - Add a Kannada pronunciation lexicon and basic unit tests validating G2P outputs.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
nemo/collections/tts/g2p/models/kn_in_ipa.py |
New Kannada G2P implementation (dictionary + rule-based). |
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py |
Adds kn-IN locale support, including grapheme/IPA sets and punctuation. |
scripts/tts_dataset_files/kn_IN/kn_IN_nv260318.dict |
New Kannada pronunciation dictionary (~4.3K entries). |
tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py |
Adds unit tests for Kannada G2P behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Handle digits (pass through or convert) | ||
| if char.isdigit(): | ||
| phonemes.append(char) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Handle Kannada digits | ||
| kannada_digits = '೦೧೨೩೪೫೬೭೮೯' | ||
| if char in kannada_digits: | ||
| # Convert to Arabic numeral | ||
| phonemes.append(str(kannada_digits.index(char))) | ||
| i += 1 | ||
| continue |
There was a problem hiding this comment.
Kannada digits (೦-೯) will never reach the Kannada-digit conversion block because str.isdigit() is true for Kannada digits, so the earlier if char.isdigit(): phonemes.append(char) branch consumes them. This causes Kannada digits to be returned unchanged instead of being mapped to Arabic numerals as intended. Consider checking for Kannada digits before isdigit(), or restricting the isdigit() branch to ASCII digits only (e.g., char.isascii() and char.isdigit()).
| phoneme_dict = ( | ||
| self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | ||
| if isinstance(phoneme_dict, (str, pathlib.Path)) | ||
| else phoneme_dict | ||
| ) |
There was a problem hiding this comment.
When phoneme_dict is passed as a Python dict, entries like {word: ["namaskaːɾa"]} are kept as whole-string tokens, but the rule-based path emits per-character IPA tokens. This makes outputs inconsistent across dictionary vs OOV words and can break downstream tokenizers that expect single-symbol IPA tokens (e.g., 'ː' separate from 'a'). Consider normalizing dict-provided pronunciations by splitting each pronunciation string into a list of IPA symbols/characters (and applying phoneme_prefix) to match _parse_phoneme_dict behavior.
| phoneme_dict = ( | |
| self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | |
| if isinstance(phoneme_dict, (str, pathlib.Path)) | |
| else phoneme_dict | |
| ) | |
| if isinstance(phoneme_dict, (str, pathlib.Path)): | |
| phoneme_dict = self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | |
| else: | |
| normalized_phoneme_dict = {} | |
| for word, prons in phoneme_dict.items(): | |
| normalized_prons = [] | |
| for pron in prons: | |
| if isinstance(pron, str): | |
| normalized_prons.extend([phoneme_prefix + symbol for symbol in pron]) | |
| else: | |
| normalized_prons.extend(pron) | |
| normalized_phoneme_dict[word] = normalized_prons | |
| phoneme_dict = normalized_phoneme_dict |
| Args: | ||
| phoneme_dict: Path to Kannada pronunciation dictionary file or a dict object. | ||
| Format: word<whitespace>phonemes (space-separated IPA symbols) | ||
| phoneme_prefix: Prefix to prepend to phoneme symbols to distinguish from graphemes. | ||
| Default is "" (no prefix). | ||
| ascii_letter_prefix: Prefix to prepend to ASCII letters for code-mixed text. | ||
| Default is "" (no prefix). | ||
| ascii_letter_case: Case for ASCII letters: "upper", "lower", or "mixed". | ||
| Default is "lower". | ||
| word_tokenize_func: Custom function for tokenizing text into words. | ||
| Should return List[Tuple[Union[str, List[str]], bool]]. | ||
| apply_to_oov_word: Custom function to apply to out-of-vocabulary words. | ||
| If None, rule-based G2P is used. | ||
| mapping_file: Optional path to character mapping file. |
There was a problem hiding this comment.
The docstring states the dictionary format uses “space-separated IPA symbols”, but _parse_phoneme_dict() (and the provided .dict file) treat pronunciations as a single IPA string without spaces and then split into characters. Please align the documentation with the actual expected format to avoid confusing users providing custom dictionaries.
| import re | ||
| import unicodedata | ||
| from collections import defaultdict | ||
| from typing import Dict, List, Optional, Union | ||
|
|
||
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( | ||
| GRAPHEME_CHARACTER_SETS, |
There was a problem hiding this comment.
re and GRAPHEME_CHARACTER_SETS are imported but not used in this module. Removing unused imports will avoid lint failures and keep the file clean.
| import re | |
| import unicodedata | |
| from collections import defaultdict | |
| from typing import Dict, List, Optional, Union | |
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( | |
| GRAPHEME_CHARACTER_SETS, | |
| import unicodedata | |
| from collections import defaultdict | |
| from typing import Dict, List, Optional, Union | |
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( |
| '್', # Virama (Halant) | ||
| 'ಽ', # Avagraha |
There was a problem hiding this comment.
The kn-IN grapheme character set omits some characters that the new Kannada G2P code references/handles (e.g., nukta '಼') and common script punctuation (e.g., danda '।', double danda '॥'). If these are expected in input text, consider adding them to GRAPHEME_CHARACTER_SETS["kn-IN"] (similar to hi-IN) so downstream grapheme-based processing/tokenizers can recognize them.
| '್', # Virama (Halant) | |
| 'ಽ', # Avagraha | |
| '್', # Virama (Halant) | |
| '಼', # Nukta | |
| 'ಽ', # Avagraha | |
| # Script punctuation | |
| '।', # Danda | |
| '॥', # Double danda |
| @pytest.mark.run_only_on('CPU') | ||
| @pytest.mark.unit | ||
| def test_kannada_g2p_rule_based(self): | ||
| """Test Kannada G2P rule-based conversion.""" | ||
| g2p = KannadaG2p() | ||
|
|
||
| # Test basic vowels | ||
| assert "".join(g2p("ಅ")) == "a" | ||
| assert "".join(g2p("ಆ")) == "aː" | ||
|
|
||
| # Test consonant with inherent vowel | ||
| assert "".join(g2p("ಕ")) == "ka" | ||
|
|
||
| # Test consonant with virama (no vowel) | ||
| assert "".join(g2p("ಕ್")) == "k" | ||
|
|
||
| # Test consonant with matra | ||
| assert "".join(g2p("ಕಾ")) == "kaː" | ||
| assert "".join(g2p("ಕಿ")) == "ki" | ||
|
|
||
| # Test word | ||
| assert "".join(g2p("ಕನ್ನಡ")) == "kannaɖa" | ||
|
|
||
| @pytest.mark.run_only_on('CPU') | ||
| @pytest.mark.unit | ||
| def test_kannada_g2p_dictionary_based(self): | ||
| """Test Kannada G2P with dictionary lookup.""" | ||
| g2p = KannadaG2p(phoneme_dict=self.PHONEME_DICT_KN) | ||
|
|
||
| # Test dictionary lookup | ||
| assert "".join(g2p("ನಮಸ್ಕಾರ")) == "namaskaːɾa" | ||
| assert "".join(g2p("ಕನ್ನಡ")) == "kannaɖa" | ||
| assert "".join(g2p("ಬೆಂಗಳೂರು")) == "beŋgaɭuːɾu" | ||
|
|
||
| @pytest.mark.run_only_on('CPU') | ||
| @pytest.mark.unit | ||
| def test_kannada_g2p_special_characters(self): | ||
| """Test Kannada G2P with special characters.""" | ||
| g2p = KannadaG2p() | ||
|
|
||
| # Test anusvara (nasal) | ||
| result = "".join(g2p("ಅಂಕ")) | ||
| assert "ŋ" in result # Anusvara before velar should be ŋ | ||
|
|
||
| # Test retroflex consonants | ||
| assert "ɖ" in "".join(g2p("ಡ")) | ||
| assert "ɳ" in "".join(g2p("ಣ")) | ||
|
|
||
| # Test vocalic R | ||
| assert "ɾɯ" in "".join(g2p("ಕೃ")) | ||
|
|
||
| @pytest.mark.run_only_on('CPU') | ||
| @pytest.mark.unit | ||
| def test_kannada_g2p_affricates(self): | ||
| """Test Kannada G2P affricate handling.""" | ||
| g2p = KannadaG2p() | ||
|
|
||
| # Test palatal affricates | ||
| result = "".join(g2p("ಚ")) | ||
| assert "tʃ" in result or ("t" in result and "ʃ" in result) | ||
|
|
||
| result = "".join(g2p("ಜ")) | ||
| assert "dʒ" in result or ("d" in result and "ʒ" in result) |
There was a problem hiding this comment.
The new Kannada G2P tests don’t cover Kannada digit handling (೦-೯) or punctuation passthrough, even though the implementation includes explicit logic for both. Adding assertions for these cases would prevent regressions (and would have caught the current Kannada-digit bug).
The G2P module handles:
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add Kannada (kn-IN) G2P support for TTS
Collection: [TTS]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information