Commit 27101c0
cli : merge tokens split across UTF-8 boundaries in JSON output (#3751)
* cli : merge tokens split across UTF-8 boundaries in JSON output
When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.
Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.
Refs #1798
* fix: address review — add braces for consistency, use full issue URL
- Add braces to if/else chain for codebase consistency
- Use full URL for issue #1798 reference
Review: @danbev
---------
Co-authored-by: texasich <texasich@users.noreply.github.com>
Co-authored-by: texasich <texasich@gmail.com>1 parent e0fd1f6 commit 27101c0
1 file changed
Lines changed: 71 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
34 | 67 | | |
35 | 68 | | |
36 | 69 | | |
| |||
738 | 771 | | |
739 | 772 | | |
740 | 773 | | |
741 | | - | |
742 | | - | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
743 | 805 | | |
744 | | - | |
745 | | - | |
| 806 | + | |
| 807 | + | |
746 | 808 | | |
747 | | - | |
| 809 | + | |
748 | 810 | | |
749 | | - | |
750 | | - | |
751 | | - | |
752 | | - | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
753 | 815 | | |
754 | 816 | | |
755 | 817 | | |
| |||
0 commit comments