fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer) by tcconnally · Pull Request #6136 · google/adk-python

tcconnally · 2026-06-15T21:21:00Z

Problem

When evaluating text in non-Latin scripts (Thai, Chinese, Japanese, Arabic, etc.), the v1 ROUGE-1 evaluator returns scores of 0.0 even when the response matches the expected output exactly.

Root cause: The rouge_score library's default tokenizer uses re.findall(r'\\w+', text) which only matches ASCII [a-zA-Z0-9_]. Non-Latin characters produce zero tokens → ROUGE-1 score of 0.0 regardless of correctness.

Reproduction (from #3111)

agent = Agent(
    model="gemini-2.5-flash",
    instruction='Reply with only the word "สวัสดี"',
)
# Agent responds "สวัสดี" → ROUGE-1 score: 0.0 (should be 1.0)

Fix

Added _unicode_tokenize function that:

Uses re.UNICODE flag for ASCII-majority text (preserves existing behavior)
Splits on Unicode whitespace/punctuation for non-ASCII text
Falls back to character-level tokens for scripts without word boundaries (Chinese, Japanese)

Closes #3111

rohityan · 2026-06-17T18:31:37Z

Hi @tcconnally , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors.

The default RougeScorer tokenizer uses r'\\w+' regex which only matches ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese, etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when the response matches the expected output exactly. Added _unicode_tokenize function that uses re.UNICODE flag and falls back to character-level tokenization for non-ASCII scripts. Closes google#3111

- Replace function _unicode_tokenize with _UnicodeTokenizer class implementing the tokenize() method expected by RougeScorer - Move import re to module level - Fix double-escaped regex patterns (\w -> \w, remove unsupported \p{P}) - Add return type annotation for tokenize() to satisfy mypy strict mode - Fix RougeScorer constructor indentation

tcconnally · 2026-06-17T18:40:46Z

Fixed the pre-commit formatting issue (pyink). Rebased on main.

The previous tokenizer had two defects: - Its char-level fallback was unreachable: it split non-ASCII text on whitespace first, and scripts without spaces (Chinese, Japanese, Thai) yield a single token, so the `list(text)` fallback never ran. Two different CJK strings sharing characters scored 0.0 instead of getting partial credit. - Passing a custom `tokenizer=` makes rouge-score ignore `use_stemmer`, so English stemming was silently dropped (e.g. "running" no longer matched "run"). Now ASCII-majority text is delegated to rouge-score's DefaultTokenizer (preserving Porter stemming and existing behavior exactly), and non-ASCII text keeps Latin/digit runs as words while splitting remaining word characters individually so partial overlap is scored. Verified: Thai exact=1.0, CJK exact=1.0, CJK partial(你好世界 vs 你好朋友)=0.5, English stemming(running fast vs run fast)=1.0, Latin sanity matches default.

tcconnally · 2026-06-26T15:45:07Z

Pushed a follow-up commit that hardens the tokenizer — I found two issues in the previous version while validating it against the rouge-score library:

The char-level fallback was unreachable. It split non-ASCII text on whitespace first, but scripts without spaces (Chinese, Japanese, Thai) collapse to a single token, so list(text) never ran. Two different CJK strings that share characters scored 0.0 instead of partial credit — contradicting the docstring's promise of character-level tokens.
Stemming was silently dropped. Passing a custom tokenizer= makes rouge-score ignore use_stemmer, so existing English evals regressed ("running" no longer matched "run").

The updated _UnicodeTokenizer now delegates ASCII-majority text to rouge-score's DefaultTokenizer (preserving Porter stemming and existing behavior exactly), and for non-ASCII text keeps Latin/digit runs as words while splitting remaining word characters individually.

Verified against the library:

Case	Before	After
Thai exact (`สวัสดี`)	1.0	1.0
CJK exact (`你好世界`)	0.0	1.0
CJK partial (`你好世界` vs `你好朋友`, share 你好)	0.0	0.5
English stemming (`running fast` vs `run fast`)	0.5	1.0
Latin sanity (`the cat sat` vs `the cat`)	—	matches default tokenizer

Ready for another look, @wyf7107 — thanks for your patience.

tcconnally force-pushed the fix/non-english-eval-rouge branch from e275a87 to 6dff0a2 Compare June 15, 2026 21:22

rohityan self-assigned this Jun 15, 2026

wyf7107 self-assigned this Jun 16, 2026

rohityan added the eval [Component] This issue is related to evaluation label Jun 17, 2026

rohityan removed their assignment Jun 17, 2026

rohityan added the needs review [Status] The PR/issue is awaiting review from the maintainer label Jun 17, 2026

tcconnally added 3 commits June 17, 2026 18:39

chore: apply pyink formatting

98396a4

tcconnally force-pushed the fix/non-english-eval-rouge branch from 9beec74 to 98396a4 Compare June 17, 2026 18:40

tcconnally added 2 commits June 25, 2026 08:47

Merge branch 'main' into fix/non-english-eval-rouge

d04be01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136

fix: ROUGE-1 eval returns 0 for non-English languages (ASCII-only tokenizer)#6136
tcconnally wants to merge 5 commits into
google:mainfrom
Perseus-Computing-LLC:fix/non-english-eval-rouge

tcconnally commented Jun 15, 2026

Uh oh!

rohityan commented Jun 17, 2026

Uh oh!

tcconnally commented Jun 17, 2026

Uh oh!

tcconnally commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tcconnally commented Jun 15, 2026

Problem

Reproduction (from #3111)

Fix

Uh oh!

rohityan commented Jun 17, 2026

Uh oh!

tcconnally commented Jun 17, 2026

Uh oh!

tcconnally commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants