Commit 12e4d94
fix(tokenizer): Qwen3.6 BOS = <|endoftext|> (248044), not <|im_start|>
ROOT CAUSE FOUND for Qwen3.6-27B Tier 3 forward-pass divergence.
Investigation chain:
1. basin_compat showed L0 element-level sign flip (ours +0.25 vs llama -0.29)
2. Pre-norm input also sign-flipped (ours +0.0035 vs inferred llama -0.003)
3. Embedding lookup itself diverged at supposed-same token
4. Token IDs traced via TQ_DEBUG_TOKENS env: ours=[248045, 9419] but
GGUF bos_token_id metadata = 248044 (<|endoftext|>)
5. vocab[248044] = '<|endoftext|>', vocab[248045] = '<|im_start|>'
6. tq_encode str_lookup chain hits <|im_start|> first (id 248045)
before <|endoftext|> (id 248044) is checked → wrong BOS
Fix:
- src/engine/tq_tokenizer.c: append <|endoftext|> to BOS str_lookup chain
(still preferred AFTER <|im_start|> for backward compat with smaller
Qwen models that use <|im_start|> as functional BOS)
- src/engine/tq_generate.c: for Qwen3.6 family (vocab > 240K), detect
presence of <|endoftext|> and override prompt_tokens[0] to that id.
Bypasses the str_lookup ordering issue without breaking Qwen3-0.6B,
Qwen3.5-4B, etc. (which have smaller vocab and use older convention).
- src/engine/tq_transformer.c: enhanced [dn-trace] output to include
attn_norm first3+last3 and pre-norm input for paired-diff debugging.
Verified after fix:
Tokens: [248044, 9419] ✓ matches llama
L0 attn_norm pos=0 (BOS): first3 = [-0.2891, -0.6430, 0.4991]
llama row 0 first3: [-0.2891, -0.6430, 0.4991] ✓ BIT-EXACT
Remaining issue: pos=1 ("Hello" token id 9419) doesn't match llama,
suggesting llama tokenizes start-of-prompt with implicit space prefix
("ĠHello" id 21251). This is BPE pre-tokenizer behavior — separate
fix needed in pre_tokenize_gpt2_bpe path, not blocking BOS fix.
Earlier verdict "Qwen3.6-27B is Tier 3, fundamental forward-pass bug"
was WRONG. The forward pass is correct; tokenization was the issue.
With BOS fix, L0 BOS row is bit-exact to llama. Real tier classification
requires re-running coh_bench after pre-tokenizer fix lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 0829285 commit 12e4d94
3 files changed
Lines changed: 53 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
356 | 356 | | |
357 | 357 | | |
358 | 358 | | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
359 | 390 | | |
360 | 391 | | |
361 | 392 | | |
362 | 393 | | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
363 | 404 | | |
364 | 405 | | |
365 | 406 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1203 | 1203 | | |
1204 | 1204 | | |
1205 | 1205 | | |
1206 | | - | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
1207 | 1209 | | |
1208 | 1210 | | |
1209 | 1211 | | |
1210 | 1212 | | |
1211 | 1213 | | |
1212 | 1214 | | |
| 1215 | + | |
1213 | 1216 | | |
1214 | 1217 | | |
1215 | 1218 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
702 | 702 | | |
703 | 703 | | |
704 | 704 | | |
705 | | - | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
706 | 713 | | |
707 | 714 | | |
708 | 715 | | |
| |||
0 commit comments