fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136
Open
tcconnally wants to merge 5 commits into
Open
fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136tcconnally wants to merge 5 commits into
tcconnally wants to merge 5 commits into
Conversation
e275a87 to
6dff0a2
Compare
Collaborator
|
Hi @tcconnally , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors. |
The default RougeScorer tokenizer uses r'\\w+' regex which only matches ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese, etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when the response matches the expected output exactly. Added _unicode_tokenize function that uses re.UNICODE flag and falls back to character-level tokenization for non-ASCII scripts. Closes google#3111
- Replace function _unicode_tokenize with _UnicodeTokenizer class
implementing the tokenize() method expected by RougeScorer
- Move import re to module level
- Fix double-escaped regex patterns (\w -> \w, remove unsupported \p{P})
- Add return type annotation for tokenize() to satisfy mypy strict mode
- Fix RougeScorer constructor indentation
9beec74 to
98396a4
Compare
Author
|
Fixed the pre-commit formatting issue (pyink). Rebased on main. |
The previous tokenizer had two defects: - Its char-level fallback was unreachable: it split non-ASCII text on whitespace first, and scripts without spaces (Chinese, Japanese, Thai) yield a single token, so the `list(text)` fallback never ran. Two different CJK strings sharing characters scored 0.0 instead of getting partial credit. - Passing a custom `tokenizer=` makes rouge-score ignore `use_stemmer`, so English stemming was silently dropped (e.g. "running" no longer matched "run"). Now ASCII-majority text is delegated to rouge-score's DefaultTokenizer (preserving Porter stemming and existing behavior exactly), and non-ASCII text keeps Latin/digit runs as words while splitting remaining word characters individually so partial overlap is scored. Verified: Thai exact=1.0, CJK exact=1.0, CJK partial(你好世界 vs 你好朋友)=0.5, English stemming(running fast vs run fast)=1.0, Latin sanity matches default.
Author
|
Pushed a follow-up commit that hardens the tokenizer — I found two issues in the previous version while validating it against the
The updated Verified against the library:
Ready for another look, @wyf7107 — thanks for your patience. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When evaluating text in non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.), the v1 ROUGE-1 evaluator returns scores of 0.0 even when the response matches the expected output exactly.
Root cause: The
rouge_scorelibrary's default tokenizer usesre.findall(r'\\w+', text)which only matches ASCII[a-zA-Z0-9_]. Non-Latin characters produce zero tokens → ROUGE-1 score of 0.0 regardless of correctness.Reproduction (from #3111)
Fix
Added
_unicode_tokenizefunction that:re.UNICODEflag for ASCII-majority text (preserves existing behavior)Closes #3111