★ debug(logit): EOS rank diagnosis reframes 35B 1000-tok problem

unamedkr · claude · unamedkr · commit c551293aac82 · 2026-04-22T08:43:50.000+09:00
User asked: "Could the failure happen at termination time — EOS token
issue?" Adding EOS rank to TQ_LOGIT_PROBE answered decisively.

On Qwen3.6-35B, "Once upon a time in a faraway land", -n 275, T=2.0:

  pos=25   EOS rank 511  (mid-narrative, irrelevant)
  pos=100  EOS rank 65
  pos=125  EOS rank 47   (Sorry! loop starts)
  pos=175  EOS rank 13   (alphabet walk starts)
  pos=250  EOS rank 6    (alphabet walk continues)

EOS rank climbs 511 → 6 through the degradation. The model IS trying
to terminate with increasing confidence, but top1 always wins at T=0
by 3-7 logits. So "alphabet walk" is the model stuck between wanting
to stop and being forced to output another token.

This reframes the 1000-tok target:
- 6-word "Once upon a time" prompt naturally merits ~150-200 tokens;
  beyond that the model signals EOS increasingly strongly
- Substantive --chat prompt with Qwen3-Thinking template emits EOS
  IMMEDIATELY (empty &lt;think&gt; block is malformed, model responds with
  just &lt;|im_end|&gt;)

Neither is a quant/DeltaNet/MoE bug. The 1000-tok headline metric
requires chat-template work (fill &lt;think&gt; block or use non-thinking
branch) OR a base-completion prompt with scaffold that makes 1000
tokens in-distribution.

Saved user's one-line insight as permanent memory
(feedback_eos_rank_diagnosis.md + MEMORY.md index): "Before chasing
residual-collapse as the cause, check EOS rank first — cheap,
often decisive."

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,62 @@
 **Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
 **Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
 
+## ★ Phase 3 R39 — EOS rank diagnosis reframes the 1000-tok problem ★
+
+**User insight**: "혹시 종료할 시점에 종료를 하지 못해서 발생하는건 아닌지?" —
+could the degenerate output be the model unable to emit EOS?
+
+Added EOS rank to `TQ_LOGIT_PROBE` output. Qwen3.6-35B UD-IQ4_XS,
+"Once upon a time in a faraway land", T=2.0 (auto-default), -n 275:
+
+| pos | EOS rank | top1-EOS logit gap | observable |
+|---:|---:|---:|:---|
+| 25 | 511 | 17.5 | normal narrative |
+| 100 | 65 | 8.1 | normal |
+| 125 | 47 | 12.7 | "Sorry!" loop starts |
+| 175 | **13** | 6.0 | alphabet walk begins |
+| 200 | **13** | 5.9 | alphabet walk |
+| 250 | **6** | 6.7 | alphabet walk continues |
+
+**EOS rank climbs 511 → 6 through the degradation**. The model IS
+signaling termination with increasing confidence, but top1 always wins
+at T=0 by 3-7 logits. So the "alphabet walk" is the model **stuck
+between wanting to stop and being forced to output another token**.
+
+### Reframing the 1000-tok problem
+
+The 6-word prompt "Once upon a time in a faraway land" doesn't merit
+1000 coherent tokens. The model's natural answer is ~150-200 tokens of
+narrative then EOS. Forcing `-n 1000` on it means most of those tokens
+are post-EOS-attempt confusion.
+
+Tested with substantive prompt + `--chat`:
+- Qwen3.6-Thinking-Instruct chat template primes `<think>\n\n</think>\n\n`
+- Result: 0 tokens generated (model emits EOS immediately because
+  empty `<think>` block is malformed — needs actual reasoning content
+  in thinking mode)
+
+So 1000+ coherent tokens on 35B requires one of:
+1. Chat-template work: let model generate filled `<think>...</think>`
+   block before the response, OR use a non-thinking branch of Qwen3.6
+2. A base-completion prompt with enough structural scaffolding (system
+   + instruction + expected format) that 1000 tokens is in-distribution
+
+Neither is a quantization / DeltaNet / MoE bug. Both are product/ergonomics
+work on the chat pipeline.
+
+### Diagnostic deliverable
+
+`TQ_LOGIT_PROBE` now reports EOS rank per probe position — a cheap,
+decisive check. From `feedback_eos_rank_diagnosis.md`:
+
+> Rule: when a model emits degenerate output at long positions, BEFORE
+> assuming residual-space collapse / quantization drift / KV corruption,
+> ask: *is EOS rank climbing toward top-1?* If yes, the model is trying
+> to terminate.
+
+Saved as memory `feedback_eos_rank_diagnosis.md` + indexed in MEMORY.md.
+
 ## Phase 3 R38 — 1000-tok target diagnosis — logits peaky, residual collapse suspected (2026-04-22)
 
 User-set breakthrough metric: **coherent generation to 1000+ tokens on
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -3238,6 +3238,28 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
             char _slot[16]; snprintf(_slot, sizeof(_slot), "h%d", l);
             tq_dump_hidden(_slot, s->x, dim, pos);
         }
+        /* TQ_RESIDUAL_PROBE=every=N prints per-layer s->x rms + max-abs at
+         * every N-th position. Used to localize residual-stream collapse
+         * that drives 35B long-gen alphabet-walk (R38 diagnosis). */
+        {
+            const char* _rp = getenv("TQ_RESIDUAL_PROBE");
+            if (_rp) {
+                int every = 0;
+                const char* eq = strstr(_rp, "every=");
+                if (eq) every = atoi(eq + 6);
+                if (every <= 0) every = 25;
+                if ((pos % every) == 0 && pos > 0) {
+                    double ss = 0; float mx = 0;
+                    for (int i = 0; i < dim; i++) {
+                        ss += (double)s->x[i] * s->x[i];
+                        float a = fabsf(s->x[i]);
+                        if (a > mx) mx = a;
+                    }
+                    fprintf(stderr, "[res-probe] pos=%d L%d rms=%.3f max_abs=%.3f\n",
+                            pos, l, (float)sqrt(ss / dim), mx);
+                }
+            }
+        }
         /* Post-layer processing: PLE, layer_output_scale.
          * GPU graph path jumps here after full-layer GPU forward. */
 
@@ -3388,10 +3410,29 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
                     double p = expf(s->logits[i] - maxl) / Z;
                     if (p > 1e-30) H -= p * (log(p));
                 }
-                fprintf(stderr, "[logit-probe] pos=%d top5_logits=[%.3f,%.3f,%.3f,%.3f,%.3f] top5_ids=[%d,%d,%d,%d,%d] margin_1_to_2=%.3f entropy=%.3f nats\n",
+                /* EOS rank + logit: is the model trying to stop but getting
+                 * overruled by a peakier wrong token? Qwen3.6 EOS=248046,
+                 * Qwen3.x-thinking may use <|im_end|>=151645 etc.
+                 * We check a few common IDs and report the max-logit one. */
+                int eos_candidates[] = {248046, 248044, 151645, 128001, 128009, 2};
+                int n_eos = sizeof(eos_candidates)/sizeof(eos_candidates[0]);
+                float eos_logit = -1e30f; int eos_id = -1;
+                for (int e = 0; e < n_eos; e++) {
+                    int id = eos_candidates[e];
+                    if (id >= 0 && id < c->vocab_size && s->logits[id] > eos_logit) {
+                        eos_logit = s->logits[id]; eos_id = id;
+                    }
+                }
+                /* Compute EOS rank: how many tokens have higher logit than EOS */
+                int eos_rank = 0;
+                if (eos_id >= 0) {
+                    for (int i = 0; i < c->vocab_size; i++)
+                        if (s->logits[i] > eos_logit) eos_rank++;
+                }
+                fprintf(stderr, "[logit-probe] pos=%d top5_logits=[%.3f,%.3f,%.3f,%.3f,%.3f] top5_ids=[%d,%d,%d,%d,%d] margin=%.3f entropy=%.3f eos_id=%d eos_logit=%.3f eos_rank=%d\n",
                         pos, top[0], top[1], top[2], top[3], top[4],
                         top_idx[0], top_idx[1], top_idx[2], top_idx[3], top_idx[4],
-                        top[0]-top[1], H);
+                        top[0]-top[1], H, eos_id, eos_logit, eos_rank);
             }
         }
     }