fix(tokenizer): BPE decode double-UTF-8 for direct byte codepoints

unamedkr · claude · unamedkr · commit 9c53491898a2 · 2026-04-21T14:07:20.000+09:00
decode_bpe_token was emitting raw UTF-8 bytes for codepoints U+0080–U+00FF
instead of reversing GPT-2's byte-to-unicode mapping. Result: any accented
char (or emoji/CJK routed through byte fallback) on Llama-3/Qwen-style BPE
came out double-encoded.

Before: "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 83 c2 a9 ("cafÃ©")
After:  "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 a9          ("café")

Scope: direct-byte codepoints U+00A1-U+00AC and U+00AE-U+00FF (GPT-2's
"direct" mapping set). Indirect codepoints U+0100+ were already handled
via indirect_to_byte.

Discovered via refparity R5's A/B test on Llama-3.2-1B — default path
emitted 'cafÃ©'. Regression suite 15/15 PASS unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,24 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★ Phase 1 R6 — BPE decode double-UTF-8 bug FIXED (2026-04-21) ★
+
+`src/engine/tq_tokenizer.c:1089-1093`: decode_bpe_token for codepoints
+U+0080-U+00FF was emitting raw UTF-8 bytes (c3 83 for 'Ã') instead of
+reversing GPT-2's byte-to-unicode mapping.
+
+Before: "café" (bytes 63 61 66 c3 a9) → engine emitted 63 61 66 **c3 83 c2 a9**
+After:  "café" (bytes 63 61 66 c3 a9) → engine emits   63 61 66 c3 a9 ✓
+
+Any Llama-3 / Qwen-style BPE output containing accented chars, non-ASCII
+punctuation, emoji (via byte-fallback) was getting silently double-encoded.
+Discovered via R5's A/B test surfacing the 'cafÃ©' artifact.
+
+Scope: direct byte codepoints U+00A1-U+00AC and U+00AE-U+00FF (GPT-2's
+"direct" byte mapping). Indirect bytes at U+0100+ were already handled.
+
+Regression: 12/12 PASS (+3 Metal tier) → 15/15 PASS unchanged.
+
 ## Phase 1 R5 — TQ_NO_Q4=1 quality/speed tradeoff — NOT flipping default (2026-04-21)
 
 Cross-model A/B on "Once upon a time" (short) vs "Once upon a time in a
diff --git a/src/engine/tq_tokenizer.c b/src/engine/tq_tokenizer.c
@@ -1086,8 +1086,15 @@ static const char* decode_bpe_token(const char* piece) {
                     decode_buf[out++] = (char)p[0];
                     decode_buf[out++] = (char)p[1];
                 }
+            } else if ((cp >= 0xA1 && cp <= 0xAC) || (cp >= 0xAE && cp <= 0xFF)) {
+                /* GPT-2 direct-byte mapping: codepoints U+00A1-U+00AC and
+                 * U+00AE-U+00FF represent raw bytes of the same value. The
+                 * BPE vocab stores these as UTF-8 (e.g. 'Ã' for byte 0xC3)
+                 * so emit the raw byte to reconstruct the intended UTF-8
+                 * character (e.g. 'Ã'+'©' → bytes 0xC3 0xA9 = 'é'). */
+                decode_buf[out++] = (char)(unsigned char)cp;
             } else {
-                /* Regular 2-byte UTF-8 char (e.g., accented letters) */
+                /* Regular 2-byte UTF-8 char (rare in GPT-2-style BPE) */
                 decode_buf[out++] = (char)p[0];
                 decode_buf[out++] = (char)p[1];
             }