Skip to content

Latest commit

 

History

History
85 lines (63 loc) · 3.55 KB

File metadata and controls

85 lines (63 loc) · 3.55 KB

BPE Root-Cause Fix — Before/After Proof (2026-04-20)

Pillar 1 R3 single-line fix to src/engine/tq_tokenizer.c:1442 eliminates the structural tokenization bug that caused every Qwen3 family coherence issue tracked across Rounds 26-50.

The fix

/* tq_tokenizer.c heap-based BPE merge loop */
  if (top.gen != gen[top.pos]) continue;
+ if (tokens[top.pos] < 0) continue;   // ★ missing dead-slot check
  int ri = next[top.pos];
  if (ri >= n_tokens || tokens[ri] < 0) continue;

Why it's the root cause: When a position dies as the RIGHT neighbor of a merge, tokens[P] = -1 but gen[P] is not bumped. Stale heap entries at position P pass the gen check, the code then overwrites tokens[P] with a new merge result, resurrecting a dead linked-list node and scrambling subsequent tokens.

Before/after token mismatch (Qwen3-0.6B, "Hello")

Tokens Decoded
HF reference (ground truth) [9707] "Hello"
Our engine BEFORE R3 [32713, 654] "Helll" (5 chars: H,e,l,l,l — 'o' replaced)
Our engine AFTER R3 [9707] "Hello"

Before/after model output (Qwen3.6-35B-A3B-UD-IQ4_XS)

Same 40-word prompt: "Write a Python function that computes the nth Fibonacci number using iterative dynamic programming. It should handle edge cases including negative numbers, zero, and very large inputs."

Output
BEFORE R3 UTF-8 garbage ("� Would you like to..."), or 5-token fragment then EOS
AFTER R3 Coherent Python code:
def fibonacci(n):
"""Return the nth Fibonacci number."""
if n < 0: raise ValueError("n must be non-negative")

Same 50-word prompt: "Once upon a time in a small village there lived a clever young programmer named Luna who was known throughout the kingdom for her extraordinary ability..."

Output
BEFORE R3 Char-doubling garbage ("quicck bbrrown")
AFTER R3 Full narrative: "The idea intrigued him so much that he decided to create his very own version of this classic game. He called it 'Hamster Run'..."

Cross-model impact

Model Prompt Before After
Qwen3-0.6B "Hello" "p('��..." garbage "Hello" token
Qwen3.6-35B IQ4_XS 40+ word code garbage perfect Python
Qwen3.6-35B Q5_K_M factual drift ≥25 tok clean EOS
Phi-3.5 Q4 "What is 2+2?" "I'm sorry but 'tti'..." "The sum of 2 and 2 is equal to four."
Phi-3.5 Q8 same same garbage same fix
Llama-3.2-3B long story PASS PASS (unaffected — different tokenizer quirk)

Regression suite

scripts/test_models.sh: 15/15 PASS after fix + expected-string update (Phi-3.5 "answer" → "sum" because model now gives actual math).

Methodology

  • Pillar 1 R1: Python HF reference env (Qwen3-0.6B FP32, torch 2.11)
  • Pillar 1 R2: HF per-layer hidden state dump tool
  • Pillar 1 R3: Token-level comparison revealed the bug before any hidden-state diff was needed.

The previous 30+ rounds (R26-R50) had assumed the tokenizer was correct (per R32 Mission C note "drift is Qwen-common, not tokenizer"). Reference diff methodology made the token mismatch undeniable in one print(t.encode("Hello")) call.

Lesson

Compare tokens first, then hidden states, then layer outputs. Don't "rule out" a suspect without actually comparing to ground truth.

Next

  • Pillar 1 complete
  • Pillar 2 (prefill speed for long docs) now unblocked
  • Pillar 3 (document Q&A / code review / agent workflows) now possible
  • v0.19.0 release with this as headline feature