Skip to content

Commit 9c53491

Browse files
unamedkrclaude
andcommitted
fix(tokenizer): BPE decode double-UTF-8 for direct byte codepoints
decode_bpe_token was emitting raw UTF-8 bytes for codepoints U+0080–U+00FF instead of reversing GPT-2's byte-to-unicode mapping. Result: any accented char (or emoji/CJK routed through byte fallback) on Llama-3/Qwen-style BPE came out double-encoded. Before: "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 83 c2 a9 ("café") After: "café" bytes 63 61 66 c3 a9 → output 63 61 66 c3 a9 ("café") Scope: direct-byte codepoints U+00A1-U+00AC and U+00AE-U+00FF (GPT-2's "direct" mapping set). Indirect codepoints U+0100+ were already handled via indirect_to_byte. Discovered via refparity R5's A/B test on Llama-3.2-1B — default path emitted 'café'. Regression suite 15/15 PASS unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6975522 commit 9c53491

2 files changed

Lines changed: 26 additions & 1 deletion

File tree

.claude/state.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,24 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★ Phase 1 R6 — BPE decode double-UTF-8 bug FIXED (2026-04-21) ★
7+
8+
`src/engine/tq_tokenizer.c:1089-1093`: decode_bpe_token for codepoints
9+
U+0080-U+00FF was emitting raw UTF-8 bytes (c3 83 for 'Ã') instead of
10+
reversing GPT-2's byte-to-unicode mapping.
11+
12+
Before: "café" (bytes 63 61 66 c3 a9) → engine emitted 63 61 66 **c3 83 c2 a9**
13+
After: "café" (bytes 63 61 66 c3 a9) → engine emits 63 61 66 c3 a9 ✓
14+
15+
Any Llama-3 / Qwen-style BPE output containing accented chars, non-ASCII
16+
punctuation, emoji (via byte-fallback) was getting silently double-encoded.
17+
Discovered via R5's A/B test surfacing the 'café' artifact.
18+
19+
Scope: direct byte codepoints U+00A1-U+00AC and U+00AE-U+00FF (GPT-2's
20+
"direct" byte mapping). Indirect bytes at U+0100+ were already handled.
21+
22+
Regression: 12/12 PASS (+3 Metal tier) → 15/15 PASS unchanged.
23+
624
## Phase 1 R5 — TQ_NO_Q4=1 quality/speed tradeoff — NOT flipping default (2026-04-21)
725

826
Cross-model A/B on "Once upon a time" (short) vs "Once upon a time in a

src/engine/tq_tokenizer.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1086,8 +1086,15 @@ static const char* decode_bpe_token(const char* piece) {
10861086
decode_buf[out++] = (char)p[0];
10871087
decode_buf[out++] = (char)p[1];
10881088
}
1089+
} else if ((cp >= 0xA1 && cp <= 0xAC) || (cp >= 0xAE && cp <= 0xFF)) {
1090+
/* GPT-2 direct-byte mapping: codepoints U+00A1-U+00AC and
1091+
* U+00AE-U+00FF represent raw bytes of the same value. The
1092+
* BPE vocab stores these as UTF-8 (e.g. 'Ã' for byte 0xC3)
1093+
* so emit the raw byte to reconstruct the intended UTF-8
1094+
* character (e.g. 'Ã'+'©' → bytes 0xC3 0xA9 = 'é'). */
1095+
decode_buf[out++] = (char)(unsigned char)cp;
10891096
} else {
1090-
/* Regular 2-byte UTF-8 char (e.g., accented letters) */
1097+
/* Regular 2-byte UTF-8 char (rare in GPT-2-style BPE) */
10911098
decode_buf[out++] = (char)p[0];
10921099
decode_buf[out++] = (char)p[1];
10931100
}

0 commit comments

Comments
 (0)