-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add Arabic char tokenizer and Japanese-English katakana support #15614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
XuesongYang
merged 1 commit into
NVIDIA-NeMo:main
from
quapham:arabic_chartokenizer_japanese_english
Apr 25, 2026
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of persisting charset_version, just persist the chars argument instead.
tokenizer_config._target_to get the classThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Something like that, at a high level psuedo-code guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for bring it up. Actually I was considering whether to persist
charsvscharset_version, or sayingan integervsa long string, when implementing back-compatibility support. And my final decision goes withcharset_version. Here is why,it is definitely human-readable if persisting
chars, but it would add long unicode strings that make configs hard to read/diff. For example, Hindicharsetstring is hundreds of Unicode codepoints, which would bloat every config/log dump, compared against a single integercharset_version. We should be good withcharset_versionadded in .nemo, and add clear docstring.but again, we need to avoid such scenarios to make design complicated in long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a second thought, considering
charset_versionis still much better thanchars. So we'd better keeping this implementation.L2_TTS_Fast_dev_runs_Magpietts_OnlineCFGDistillation.sh) was just adding one line:+model.text_tokenizers.hindi_chartokenizer.charset_version=1. With chars, that line would be a 172-char Unicode blob.charsstill works as an escape hatch — the if chars is None: guard means anyone who needs a truly custom charset can still pass chars directly and charset_version is ignored.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right, keeping the config cleaner is better if we can. If the change already covers this safely, I’m happy to keep the config clean and rely on that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to merge this as this. But in the future, we need to refactor this default version to be part of the class not a top level global in the file.