whisper : make voice_length() utf-8 aware for CJK#3915
Open
buxuku wants to merge 1 commit into
Open
Conversation
voice_length() weights each token by how long its text takes to say, which drives how a segment's time is shared between its tokens. It looped over raw bytes, so every CJK character (3 bytes) was counted ~3x and full-width punctuation never matched, skewing token timestamps for Chinese/Japanese. Decode one utf-8 code point at a time and give full-width ,。!? etc. the same weights as their ASCII counterparts. Pure-ASCII text is unaffected.
danbev
approved these changes
Jun 30, 2026
danbev
left a comment
Member
There was a problem hiding this comment.
Just note that test-vad-full does not currently invoke voice_length.
I manually verified using lldb using a japanese audio sample and setting the language (-l ja).
| if (c < 0x80) { | ||
| len = 1; | ||
| } else if ((c >> 5) == 0x6) { | ||
| cp = c & 0x1F; len = 2; |
Member
There was a problem hiding this comment.
Nit: Pehaps just one statement per line here to be consistent with other parts of the codebase:
Suggested change
| cp = c & 0x1F; len = 2; | |
| cp = c & 0x1F; | |
| len = 2; |
And likewise for the lines below.
| bool ok = true; | ||
| for (int k = 1; k < len; ++k) { | ||
| const unsigned char cc = s[i + k]; | ||
| if ((cc & 0xC0) != 0x80) { ok = false; break; } |
Member
There was a problem hiding this comment.
Nit: Place if body on a new line and separate the statements.
| if ((cc & 0xC0) != 0x80) { ok = false; break; } | ||
| cp = (cp << 6) | (cc & 0x3F); | ||
| } | ||
| if (!ok) { cp = c; len = 1; } |
Member
There was a problem hiding this comment.
Nit: If body on a new line and separate statements.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
voice_length() estimates how long a token's text takes to say, and that estimate decides how a segment's time gets divided across its tokens. The loop walked the string one byte at a time, so every CJK character (3 bytes in UTF-8) counted as three units instead of one, and full-width punctuation never matched the ASCII cases. For Chinese/Japanese that skews the per-token timestamps.
This decodes one UTF-8 code point per step and treats full-width ,。!? etc. the same as their ASCII forms. ASCII-only text decodes to the same weights as before.
For example "中文," scored 9.0 before (9 bytes), and 4.0 now (中=1, 文=1, ,=2), the same as "ab,".
Came up while doing word-level subtitle timing for Chinese. Built on macOS and the existing tests/test-vad-full still passes.