Skip to content

Commit 562fa34

Browse files
unamedkrclaude
andcommitted
debug(logit): TQ_LOGIT_PROBE — diagnoses 35B long-gen residual collapse
User set 1000+ tokens coherent on Qwen3.6-35B as the headline breakthrough metric. Current ceiling (post-v0.28.0 T=2.0): ~170 coherent tokens with Q5_K_M + T=2.0 (no rep-penalty). Alphabet-walk failure mode beyond. Added TQ_LOGIT_PROBE=every=N — prints top-5 logits, token IDs, margin, and softmax entropy per N positions. Diagnosis on 35B -n 300 T=2.0: pos=25 entropy=1.02 margin=0.94 (normal) pos=125 entropy=0.29 margin=2.98 (very peaky — "Sorry!" loop starts) pos=200 entropy=1.24 margin=0.52 top5_ids=[87,68,86,85,83] ← consec IDs pos=225 entropy=0.21 margin=3.29 Crucial insight: the logits are NOT flat at degradation positions. They're extremely peaky — model is CONFIDENTLY outputting bad tokens. At pos=200, top-5 token IDs are consecutive single-char BPE tokens (83-87 range). Model residual stream has collapsed into a "single- character subspace"; lm_head projects confidently onto adjacent byte-tokens. Reframes the remaining problem: NOT logit-space, but residual-space collapse. Driven by either KV attention focusing on recent repetitive tokens, DeltaNet state saturating into a low-rank attractor, or cumulative numerical drift through 40 layers × position. Ablation sweep: k-window=256: alphabet-walk → "Sorry?" repetition (less catastrophic) k-window=64: too narrow, degrades faster delta-reset=100: "2020 dragon" loop at 125 (too aggressive) delta-reset=150: identical to default Q5_K_M + T=2.0 (no rep-pen): ~170 coherent tokens (current peak) rep-penalty 1.3: no effect on the math-loop (only TEMP=2.0 breaks it) No fix this round — 170-tok ceiling stands. Next attack vectors noted in state.md R38 entry: per-layer residual rms dump, attention-weight dump at long positions, periodic "residual refresh" re-embedding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f047447 commit 562fa34

3 files changed

Lines changed: 103 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,67 @@
33
**Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
44
**Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
55

6+
## Phase 3 R38 — 1000-tok target diagnosis — logits peaky, residual collapse suspected (2026-04-22)
7+
8+
User-set breakthrough metric: **coherent generation to 1000+ tokens on
9+
Qwen3.6-35B**. Current ceiling:
10+
11+
| Config | Coherent tokens | Failure mode |
12+
|---|---|---|
13+
| IQ4_XS, T=1.0 (pre-v0.28) | ~110 | "It could do math!" loop at 117 |
14+
| IQ4_XS, T=2.0 | ~150 | "Sorry!" mini-loop → alphabet walk from 200 |
15+
| IQ4_XS, T=2.0, --k-window 256 | ~150 | "Sorry?" repetition (no alphabet walk) |
16+
| IQ4_XS, T=2.0, delta-reset-100 | ~125 | "2020 dragon" loop (reset too aggressive) |
17+
| Q5_K_M, T=2.0 | **~170** | Longer narrative, still hits alphabet walk ~220 |
18+
19+
Peak = ~170 coherent tokens with Q5_K_M + T=2.0. Far from 1000.
20+
21+
### Logit probe findings (R38)
22+
23+
Added `TQ_LOGIT_PROBE=every=N`. Measured at pos=25..250 on T=2.0:
24+
25+
```
26+
pos=25 entropy=1.02 margin=0.94 (confident, normal)
27+
pos=50 entropy=2.08 margin=0.02 (low conf)
28+
pos=100 entropy=2.31 margin=0.25 (low conf)
29+
pos=125 entropy=0.29 margin=2.98 (VERY peaky — "Sorry!" loop starts)
30+
pos=200 entropy=1.24 margin=0.52 top5_ids=[87,68,86,85,83] ← ALPHABET RUN
31+
pos=225 entropy=0.21 margin=3.29 top5_ids=[13607,2005,515,271,3260]
32+
pos=250 entropy=0.47 margin=2.45
33+
```
34+
35+
**NOT logit flattening** (entropy is LOW = model very confident). The
36+
top-5 token IDs at pos=200 are CONSECUTIVE low IDs (83-87 range =
37+
single-char BPE tokens). Model is confidently predicting "next alphabet
38+
character" tokens one after another.
39+
40+
### New hypothesis
41+
42+
Residual stream **collapses into a narrow subspace** at long positions.
43+
The lm_head projection from that subspace happens to maximize on low-ID
44+
single-character tokens. Possible drivers:
45+
- KV attention output dominated by recent "Sorry!" / repetitive tokens
46+
→ biases residual toward their value-projection direction
47+
- DeltaNet state saturates into a low-rank attractor
48+
- Small per-position numerical drift compounds through 40 layers × pos
49+
50+
### What this session CANNOT fix without deeper work
51+
52+
- 1000-tok coherence requires preventing the residual-collapse
53+
mechanism itself. Options for future rounds:
54+
1. Per-layer residual rms dump (confirm the magnitude/direction of
55+
collapse)
56+
2. Attention weight dump at long positions (if attention focuses
57+
pathologically on a narrow slice)
58+
3. Periodic "residual refresh" (re-embed generation so far at
59+
intervals, like kv-reset but softer)
60+
4. Try a different base model (Qwen3-Next-80B if 16GB swap allows,
61+
or wait for a smaller DeltaNet+MoE variant)
62+
63+
Landed `TQ_LOGIT_PROBE` infrastructure for future investigation.
64+
Diagnostic-forward round, no fix this time — the 170-tok ceiling
65+
stands post-session.
66+
667
## ★ META-INSIGHTS (distilled from 2026-04-21→22 35-round session, keep for future sessions) ★
768

869
### Five durable patterns

docs/env_vars.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ here is opt-in; defaults are the tested production path.
2020
| `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
2121
| `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
2222
| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Measured cos ≥ 0.994 uniformly across Llama-3.x (head_dim=64), Qwen3-0.6B (128), Qwen3.5-4B (256), Qwen3.6-35B (256). No arch drift, no position drift. The probe chunks calls into TQ_BK-sized blocks to match how production handles head_dim > TQ_BK (R34 fix) |
23+
| `TQ_LOGIT_PROBE` | off | `every=N` prints top-5 logits, their token IDs, margin top1-top2, and softmax entropy at every N-th position (default N=25). Used R38 to diagnose 35B long-gen alphabet-walk: at pos=200+, top-5 token IDs become consecutive (e.g. [83,86,87,85]) with very peaky logits (entropy ~0.2 nats, margin ~3). Not logit flattening — the residual stream collapses into a "single-character subspace" the lm_head projects onto adjacent byte tokens |
2324

2425
## Quality / correctness
2526

src/engine/tq_transformer.c

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3355,6 +3355,47 @@ float* tq_forward(tq_model_t* model, tq_state_t* s, int token, int pos) {
33553355
if (g_tq_profile_enabled) g_profile.lmhead_ns += tq_now_ns() - _tp_lm;
33563356
tq_dump_hidden("logits", s->logits, c->vocab_size, pos);
33573357

3358+
/* TQ_LOGIT_PROBE=every=N — print top-5 logits + entropy at every N-th
3359+
* position. Detects logit-collapse / flattening at long positions
3360+
* (suspected cause of alphabet-walk degradation on Qwen3.6-35B beyond
3361+
* the 117-tok cliff). */
3362+
{
3363+
const char* _lp = getenv("TQ_LOGIT_PROBE");
3364+
if (_lp) {
3365+
int every = 0;
3366+
const char* eq = strstr(_lp, "every=");
3367+
if (eq) every = atoi(eq + 6);
3368+
if (every <= 0) every = 25;
3369+
if ((pos % every) == 0 || pos < 4) {
3370+
float top[5]; int top_idx[5];
3371+
for (int k = 0; k < 5; k++) { top[k] = -1e30f; top_idx[k] = -1; }
3372+
for (int i = 0; i < c->vocab_size; i++) {
3373+
float v = s->logits[i];
3374+
for (int k = 0; k < 5; k++) {
3375+
if (v > top[k]) {
3376+
for (int j = 4; j > k; j--) { top[j] = top[j-1]; top_idx[j] = top_idx[j-1]; }
3377+
top[k] = v; top_idx[k] = i;
3378+
break;
3379+
}
3380+
}
3381+
}
3382+
/* Compute softmax entropy (nats) */
3383+
float maxl = top[0];
3384+
double Z = 0, H = 0;
3385+
for (int i = 0; i < c->vocab_size; i++) Z += expf(s->logits[i] - maxl);
3386+
double logZ = log(Z);
3387+
for (int i = 0; i < c->vocab_size; i++) {
3388+
double p = expf(s->logits[i] - maxl) / Z;
3389+
if (p > 1e-30) H -= p * (log(p));
3390+
}
3391+
fprintf(stderr, "[logit-probe] pos=%d top5_logits=[%.3f,%.3f,%.3f,%.3f,%.3f] top5_ids=[%d,%d,%d,%d,%d] margin_1_to_2=%.3f entropy=%.3f nats\n",
3392+
pos, top[0], top[1], top[2], top[3], top[4],
3393+
top_idx[0], top_idx[1], top_idx[2], top_idx[3], top_idx[4],
3394+
top[0]-top[1], H);
3395+
}
3396+
}
3397+
}
3398+
33583399
if (pos <= 1 && getenv("TQ_DEBUG")) {
33593400
/* Print top-5 logits for debugging */
33603401
fprintf(stderr, "[DEBUG] pos=%d logits[0:8] = ", pos);

0 commit comments

Comments
 (0)