Commit 9c53491
fix(tokenizer): BPE decode double-UTF-8 for direct byte codepoints
decode_bpe_token was emitting raw UTF-8 bytes for codepoints U+0080–U+00FF
instead of reversing GPT-2's byte-to-unicode mapping. Result: any accented
char (or emoji/CJK routed through byte fallback) on Llama-3/Qwen-style BPE
came out double-encoded.
Before: "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 83 c2 a9 ("café")
After: "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 a9 ("café")
Scope: direct-byte codepoints U+00A1-U+00AC and U+00AE-U+00FF (GPT-2's
"direct" mapping set). Indirect codepoints U+0100+ were already handled
via indirect_to_byte.
Discovered via refparity R5's A/B test on Llama-3.2-1B — default path
emitted 'café'. Regression suite 15/15 PASS unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 6975522 commit 9c53491
2 files changed
Lines changed: 26 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
6 | 24 | | |
7 | 25 | | |
8 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1086 | 1086 | | |
1087 | 1087 | | |
1088 | 1088 | | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
1089 | 1096 | | |
1090 | | - | |
| 1097 | + | |
1091 | 1098 | | |
1092 | 1099 | | |
1093 | 1100 | | |
| |||
0 commit comments