whisper : make voice_length() utf-8 aware for CJK by buxuku · Pull Request #3915 · ggml-org/whisper.cpp

buxuku · 2026-06-27T13:32:12Z

voice_length() estimates how long a token's text takes to say, and that estimate decides how a segment's time gets divided across its tokens. The loop walked the string one byte at a time, so every CJK character (3 bytes in UTF-8) counted as three units instead of one, and full-width punctuation never matched the ASCII cases. For Chinese/Japanese that skews the per-token timestamps.

This decodes one UTF-8 code point per step and treats full-width ，。！？ etc. the same as their ASCII forms. ASCII-only text decodes to the same weights as before.

For example "中文，" scored 9.0 before (9 bytes), and 4.0 now (中=1, 文=1, ，=2), the same as "ab,".

Came up while doing word-level subtitle timing for Chinese. Built on macOS and the existing tests/test-vad-full still passes.

voice_length() weights each token by how long its text takes to say, which drives how a segment's time is shared between its tokens. It looped over raw bytes, so every CJK character (3 bytes) was counted ~3x and full-width punctuation never matched, skewing token timestamps for Chinese/Japanese. Decode one utf-8 code point at a time and give full-width ，。！？ etc. the same weights as their ASCII counterparts. Pure-ASCII text is unaffected.

danbev

Just note that test-vad-full does not currently invoke voice_length.

I manually verified using lldb using a japanese audio sample and setting the language (-l ja).

danbev · 2026-06-30T06:07:01Z

+        if (c < 0x80) {
+            len = 1;
+        } else if ((c >> 5) == 0x6) {
+            cp = c & 0x1F; len = 2;


Nit: Pehaps just one statement per line here to be consistent with other parts of the codebase:

Suggested change

cp = c & 0x1F; len = 2;

cp = c & 0x1F;

len = 2;

And likewise for the lines below.

danbev · 2026-06-30T06:28:49Z

+            bool ok = true;
+            for (int k = 1; k < len; ++k) {
+                const unsigned char cc = s[i + k];
+                if ((cc & 0xC0) != 0x80) { ok = false; break; }


Nit: Place if body on a new line and separate the statements.

danbev · 2026-06-30T06:29:08Z

+                if ((cc & 0xC0) != 0x80) { ok = false; break; }
+                cp = (cp << 6) | (cc & 0x3F);
+            }
+            if (!ok) { cp = c; len = 1; }


Nit: If body on a new line and separate statements.

danbev approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : make voice_length() utf-8 aware for CJK#3915

whisper : make voice_length() utf-8 aware for CJK#3915
buxuku wants to merge 1 commit into
ggml-org:masterfrom
buxuku:pr/voice-length-cjk

buxuku commented Jun 27, 2026

Uh oh!

danbev left a comment

Uh oh!

danbev Jun 30, 2026

Uh oh!

danbev Jun 30, 2026

Uh oh!

danbev Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

buxuku commented Jun 27, 2026

Uh oh!

danbev left a comment

Choose a reason for hiding this comment

Uh oh!

danbev Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

danbev Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

danbev Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants