Skip to content

whisper : make voice_length() utf-8 aware for CJK#3915

Open
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:pr/voice-length-cjk
Open

whisper : make voice_length() utf-8 aware for CJK#3915
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:pr/voice-length-cjk

Conversation

@buxuku

@buxuku buxuku commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

voice_length() estimates how long a token's text takes to say, and that estimate decides how a segment's time gets divided across its tokens. The loop walked the string one byte at a time, so every CJK character (3 bytes in UTF-8) counted as three units instead of one, and full-width punctuation never matched the ASCII cases. For Chinese/Japanese that skews the per-token timestamps.

This decodes one UTF-8 code point per step and treats full-width ,。!? etc. the same as their ASCII forms. ASCII-only text decodes to the same weights as before.

For example "中文," scored 9.0 before (9 bytes), and 4.0 now (中=1, 文=1, ,=2), the same as "ab,".

Came up while doing word-level subtitle timing for Chinese. Built on macOS and the existing tests/test-vad-full still passes.

voice_length() weights each token by how long its text takes to say, which drives
how a segment's time is shared between its tokens. It looped over raw bytes, so
every CJK character (3 bytes) was counted ~3x and full-width punctuation never
matched, skewing token timestamps for Chinese/Japanese.

Decode one utf-8 code point at a time and give full-width ,。!? etc. the same
weights as their ASCII counterparts. Pure-ASCII text is unaffected.

@danbev danbev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that test-vad-full does not currently invoke voice_length.

I manually verified using lldb using a japanese audio sample and setting the language (-l ja).

Comment thread src/whisper.cpp
if (c < 0x80) {
len = 1;
} else if ((c >> 5) == 0x6) {
cp = c & 0x1F; len = 2;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Pehaps just one statement per line here to be consistent with other parts of the codebase:

Suggested change
cp = c & 0x1F; len = 2;
cp = c & 0x1F;
len = 2;

And likewise for the lines below.

Comment thread src/whisper.cpp
bool ok = true;
for (int k = 1; k < len; ++k) {
const unsigned char cc = s[i + k];
if ((cc & 0xC0) != 0x80) { ok = false; break; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Place if body on a new line and separate the statements.

Comment thread src/whisper.cpp
if ((cc & 0xC0) != 0x80) { ok = false; break; }
cp = (cp << 6) | (cc & 0x3F);
}
if (!ok) { cp = c; len = 1; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If body on a new line and separate statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants