debug(logit): TQ_LOGIT_PROBE — diagnoses 35B long-gen residual collapse

unamedkr · claude · unamedkr · commit 562fa341814c · 2026-04-22T08:34:53.000+09:00
User set 1000+ tokens coherent on Qwen3.6-35B as the headline breakthrough
metric. Current ceiling (post-v0.28.0 T=2.0): ~170 coherent tokens with
Q5_K_M + T=2.0 (no rep-penalty). Alphabet-walk failure mode beyond.

Added TQ_LOGIT_PROBE=every=N — prints top-5 logits, token IDs, margin,
and softmax entropy per N positions.

Diagnosis on 35B -n 300 T=2.0:

  pos=25  entropy=1.02 margin=0.94 (normal)
  pos=125 entropy=0.29 margin=2.98 (very peaky — "Sorry!" loop starts)
  pos=200 entropy=1.24 margin=0.52 top5_ids=[87,68,86,85,83]  ← consec IDs
  pos=225 entropy=0.21 margin=3.29

Crucial insight: the logits are NOT flat at degradation positions.
They're extremely peaky — model is CONFIDENTLY outputting bad tokens.
At pos=200, top-5 token IDs are consecutive single-char BPE tokens
(83-87 range). Model residual stream has collapsed into a "single-
character subspace"; lm_head projects confidently onto adjacent
byte-tokens.

Reframes the remaining problem: NOT logit-space, but residual-space
collapse. Driven by either KV attention focusing on recent repetitive
tokens, DeltaNet state saturating into a low-rank attractor, or
cumulative numerical drift through 40 layers × position.

Ablation sweep:
  k-window=256: alphabet-walk → "Sorry?" repetition (less catastrophic)
  k-window=64:  too narrow, degrades faster
  delta-reset=100: "2020 dragon" loop at 125 (too aggressive)
  delta-reset=150: identical to default
  Q5_K_M + T=2.0 (no rep-pen): ~170 coherent tokens (current peak)
  rep-penalty 1.3: no effect on the math-loop (only TEMP=2.0 breaks it)

No fix this round — 170-tok ceiling stands. Next attack vectors noted
in state.md R38 entry: per-layer residual rms dump, attention-weight
dump at long positions, periodic "residual refresh" re-embedding.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,67 @@
 **Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
 **Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
 
+## Phase 3 R38 — 1000-tok target diagnosis — logits peaky, residual collapse suspected (2026-04-22)
+
+User-set breakthrough metric: **coherent generation to 1000+ tokens on
+Qwen3.6-35B**. Current ceiling:
+
+| Config | Coherent tokens | Failure mode |
+|---|---|---|
+| IQ4_XS, T=1.0 (pre-v0.28) | ~110 | "It could do math!" loop at 117 |
+| IQ4_XS, T=2.0 | ~150 | "Sorry!" mini-loop → alphabet walk from 200 |
+| IQ4_XS, T=2.0, --k-window 256 | ~150 | "Sorry?" repetition (no alphabet walk) |
+| IQ4_XS, T=2.0, delta-reset-100 | ~125 | "2020 dragon" loop (reset too aggressive) |
+| Q5_K_M, T=2.0 | **~170** | Longer narrative, still hits alphabet walk ~220 |
+
+Peak = ~170 coherent tokens with Q5_K_M + T=2.0. Far from 1000.
+
+### Logit probe findings (R38)
+
+Added `TQ_LOGIT_PROBE=every=N`. Measured at pos=25..250 on T=2.0:
+
+```
+pos=25  entropy=1.02 margin=0.94 (confident, normal)
+pos=50  entropy=2.08 margin=0.02 (low conf)
+pos=100 entropy=2.31 margin=0.25 (low conf)
+pos=125 entropy=0.29 margin=2.98 (VERY peaky — "Sorry!" loop starts)
+pos=200 entropy=1.24 margin=0.52 top5_ids=[87,68,86,85,83]  ← ALPHABET RUN
+pos=225 entropy=0.21 margin=3.29 top5_ids=[13607,2005,515,271,3260]
+pos=250 entropy=0.47 margin=2.45
+```
+
+**NOT logit flattening** (entropy is LOW = model very confident). The
+top-5 token IDs at pos=200 are CONSECUTIVE low IDs (83-87 range =
+single-char BPE tokens). Model is confidently predicting "next alphabet
+character" tokens one after another.
+
+### New hypothesis
+
+Residual stream **collapses into a narrow subspace** at long positions.
+The lm_head projection from that subspace happens to maximize on low-ID
+single-character tokens. Possible drivers:
+- KV attention output dominated by recent "Sorry!" / repetitive tokens
+  → biases residual toward their value-projection direction
+- DeltaNet state saturates into a low-rank attractor
+- Small per-position numerical drift compounds through 40 layers × pos
+
+### What this session CANNOT fix without deeper work
+
+- 1000-tok coherence requires preventing the residual-collapse
+  mechanism itself. Options for future rounds:
+  1. Per-layer residual rms dump (confirm the magnitude/direction of
+     collapse)
+  2. Attention weight dump at long positions (if attention focuses
+     pathologically on a narrow slice)
+  3. Periodic "residual refresh" (re-embed generation so far at
+     intervals, like kv-reset but softer)
+  4. Try a different base model (Qwen3-Next-80B if 16GB swap allows,
+     or wait for a smaller DeltaNet+MoE variant)
+
+Landed `TQ_LOGIT_PROBE` infrastructure for future investigation.
+Diagnostic-forward round, no fix this time — the 170-tok ceiling
+stands post-session.
+
 ## ★ META-INSIGHTS (distilled from 2026-04-21→22 35-round session, keep for future sessions) ★
 
 ### Five durable patterns
diff --git a/docs/env_vars.md b/docs/env_vars.md
@@ -20,6 +20,7 @@ here is opt-in; defaults are the tested production path.
 | `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
 | `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
 | `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Measured cos ≥ 0.994 uniformly across Llama-3.x (head_dim=64), Qwen3-0.6B (128), Qwen3.5-4B (256), Qwen3.6-35B (256). No arch drift, no position drift. The probe chunks calls into TQ_BK-sized blocks to match how production handles head_dim > TQ_BK (R34 fix) |
+| `TQ_LOGIT_PROBE` | off | `every=N` prints top-5 logits, their token IDs, margin top1-top2, and softmax entropy at every N-th position (default N=25). Used R38 to diagnose 35B long-gen alphabet-walk: at pos=200+, top-5 token IDs become consecutive (e.g. [83,86,87,85]) with very peaky logits (entropy ~0.2 nats, margin ~3). Not logit flattening — the residual stream collapses into a "single-character subspace" the lm_head projects onto adjacent byte tokens |
 
 ## Quality / correctness
 
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -3355,6 +3355,47 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
     if (g_tq_profile_enabled) g_profile.lmhead_ns += tq_now_ns() - _tp_lm;
     tq_dump_hidden("logits", s->logits, c->vocab_size, pos);
 
+    /* TQ_LOGIT_PROBE=every=N — print top-5 logits + entropy at every N-th
+     * position. Detects logit-collapse / flattening at long positions
+     * (suspected cause of alphabet-walk degradation on Qwen3.6-35B beyond
+     * the 117-tok cliff). */
+    {
+        const char* _lp = getenv("TQ_LOGIT_PROBE");
+        if (_lp) {
+            int every = 0;
+            const char* eq = strstr(_lp, "every=");
+            if (eq) every = atoi(eq + 6);
+            if (every <= 0) every = 25;
+            if ((pos % every) == 0 || pos < 4) {
+                float top[5]; int top_idx[5];
+                for (int k = 0; k < 5; k++) { top[k] = -1e30f; top_idx[k] = -1; }
+                for (int i = 0; i < c->vocab_size; i++) {
+                    float v = s->logits[i];
+                    for (int k = 0; k < 5; k++) {
+                        if (v > top[k]) {
+                            for (int j = 4; j > k; j--) { top[j] = top[j-1]; top_idx[j] = top_idx[j-1]; }
+                            top[k] = v; top_idx[k] = i;
+                            break;
+                        }
+                    }
+                }
+                /* Compute softmax entropy (nats) */
+                float maxl = top[0];
+                double Z = 0, H = 0;
+                for (int i = 0; i < c->vocab_size; i++) Z += expf(s->logits[i] - maxl);
+                double logZ = log(Z);
+                for (int i = 0; i < c->vocab_size; i++) {
+                    double p = expf(s->logits[i] - maxl) / Z;
+                    if (p > 1e-30) H -= p * (log(p));
+                }
+                fprintf(stderr, "[logit-probe] pos=%d top5_logits=[%.3f,%.3f,%.3f,%.3f,%.3f] top5_ids=[%d,%d,%d,%d,%d] margin_1_to_2=%.3f entropy=%.3f nats\n",
+                        pos, top[0], top[1], top[2], top[3], top[4],
+                        top_idx[0], top_idx[1], top_idx[2], top_idx[3], top_idx[4],
+                        top[0]-top[1], H);
+            }
+        }
+    }
+
     if (pos <= 1 && getenv("TQ_DEBUG")) {
         /* Print top-5 logits for debugging */
         fprintf(stderr, "[DEBUG] pos=%d logits[0:8] = ", pos);