Commit 909307c
whisper : make voice_length() utf-8 aware for CJK (#3915)
* whisper : make voice_length() utf-8 aware for CJK
voice_length() weights each token by how long its text takes to say, which drives
how a segment's time is shared between its tokens. It looped over raw bytes, so
every CJK character (3 bytes) was counted ~3x and full-width punctuation never
matched, skewing token timestamps for Chinese/Japanese.
Decode one utf-8 code point at a time and give full-width ,。!? etc. the same
weights as their ASCII counterparts. Pure-ASCII text is unaffected.
* whisper : one statement per line in voice_length()
---------
Co-authored-by: linxiaodong <calm.lin@wukongsch.com>1 parent 0874de3 commit 909307c
1 file changed
Lines changed: 76 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8397 | 8397 | | |
8398 | 8398 | | |
8399 | 8399 | | |
| 8400 | + | |
| 8401 | + | |
| 8402 | + | |
| 8403 | + | |
| 8404 | + | |
| 8405 | + | |
8400 | 8406 | | |
8401 | 8407 | | |
8402 | 8408 | | |
8403 | | - | |
8404 | | - | |
8405 | | - | |
8406 | | - | |
8407 | | - | |
8408 | | - | |
8409 | | - | |
8410 | | - | |
8411 | | - | |
8412 | | - | |
8413 | | - | |
8414 | | - | |
8415 | | - | |
| 8409 | + | |
| 8410 | + | |
| 8411 | + | |
| 8412 | + | |
| 8413 | + | |
| 8414 | + | |
| 8415 | + | |
| 8416 | + | |
| 8417 | + | |
| 8418 | + | |
| 8419 | + | |
| 8420 | + | |
| 8421 | + | |
| 8422 | + | |
| 8423 | + | |
| 8424 | + | |
| 8425 | + | |
| 8426 | + | |
8416 | 8427 | | |
8417 | | - | |
| 8428 | + | |
| 8429 | + | |
| 8430 | + | |
| 8431 | + | |
| 8432 | + | |
| 8433 | + | |
| 8434 | + | |
| 8435 | + | |
| 8436 | + | |
| 8437 | + | |
| 8438 | + | |
| 8439 | + | |
| 8440 | + | |
| 8441 | + | |
| 8442 | + | |
| 8443 | + | |
| 8444 | + | |
| 8445 | + | |
| 8446 | + | |
| 8447 | + | |
| 8448 | + | |
| 8449 | + | |
| 8450 | + | |
| 8451 | + | |
| 8452 | + | |
| 8453 | + | |
| 8454 | + | |
| 8455 | + | |
| 8456 | + | |
| 8457 | + | |
| 8458 | + | |
| 8459 | + | |
| 8460 | + | |
| 8461 | + | |
| 8462 | + | |
| 8463 | + | |
| 8464 | + | |
| 8465 | + | |
| 8466 | + | |
| 8467 | + | |
| 8468 | + | |
| 8469 | + | |
| 8470 | + | |
| 8471 | + | |
| 8472 | + | |
| 8473 | + | |
| 8474 | + | |
| 8475 | + | |
| 8476 | + | |
| 8477 | + | |
| 8478 | + | |
| 8479 | + | |
8418 | 8480 | | |
8419 | 8481 | | |
8420 | 8482 | | |
| |||
0 commit comments