|
| 1 | +# Qwen3.6-35B 117-token Cliff BREAK — MoE Router Softmax Temperature (2026-04-22) |
| 2 | + |
| 3 | +Single one-line env flag — `TQ_MOE_ROUTE_TEMP=2.0` — eliminates the |
| 4 | +"It could do math! It could do math!" repetition cliff that capped |
| 5 | +Qwen3.6-35B-A3B coherent generation at 117 tokens across 40+ prior |
| 6 | +debug rounds. |
| 7 | + |
| 8 | +## The fix |
| 9 | + |
| 10 | +`src/engine/tq_moe.c::tq_moe_route` step 3 softmax: |
| 11 | + |
| 12 | +```diff |
| 13 | + float sum_exp = 0.0f; |
| 14 | ++ static float route_temp = 0.0f; |
| 15 | ++ if (route_temp == 0.0f) { |
| 16 | ++ const char* s = getenv("TQ_MOE_ROUTE_TEMP"); |
| 17 | ++ route_temp = (s && atof(s) > 0.0f) ? (float)atof(s) : 1.0f; |
| 18 | ++ } |
| 19 | ++ float inv_temp = 1.0f / route_temp; |
| 20 | + for (int k = 0; k < num_active; k++) { |
| 21 | + if (out_expert_ids[k] < 0) { out_expert_weights[k] = 0.0f; continue; } |
| 22 | +- float e = expf(logits[out_expert_ids[k]] - max_val); |
| 23 | ++ float e = expf((logits[out_expert_ids[k]] - max_val) * inv_temp); |
| 24 | + out_expert_weights[k] = e; |
| 25 | + sum_exp += e; |
| 26 | + } |
| 27 | +``` |
| 28 | + |
| 29 | +## Temperature sweep (Qwen3.6-35B-A3B-UD-IQ4_XS, T=0) |
| 30 | + |
| 31 | +Prompt: `"Once upon a time in a faraway land"`, `-n 200`. |
| 32 | + |
| 33 | +| TEMP | Coherent tokens | Loop content | Continuation | |
| 34 | +|---:|---:|:---|:---| |
| 35 | +| 1.0 (default) | ~95 | "It could do math!" | Alex/ENIAC story collapses at 117 | |
| 36 | +| 1.5 | ~75 | "and everything went wrong!" | Cliff earlier — peakier in some heads | |
| 37 | +| 1.8 | ~90 | "And that's why we have the Internet!" | Still within the trap | |
| 38 | +| **2.0** | **~150 coherent** | none detected | Alex + sad tree story, full -n budget | |
| 39 | +| **2.5** | **~150 coherent** | none detected | Alex + magic-leaves story, full -n budget | |
| 40 | +| 3.0 | ~95 | "The sun would rise too!" | Over-flat — wrong expert mix | |
| 41 | + |
| 42 | +Sweet spot: **T=2.0 to 2.5**. Outside that band the cliff returns |
| 43 | +(earlier below, different trap above). |
| 44 | + |
| 45 | +## Why it works (causal story) |
| 46 | + |
| 47 | +1. Each MoE token selects top-K=8 experts out of 256. Softmax output |
| 48 | + weights determine how much each of the 8 contributes. |
| 49 | +2. At default T=1.0 the softmax gets **peaky at long positions** — one |
| 50 | + or two experts take 60-80% of the mass (measured in R25, L4 hit 0.812 |
| 51 | + at token 100 on this prompt). |
| 52 | +3. DeltaNet's recurrent state carries semantic through the decode. |
| 53 | + When MoE routing concentrates on a narrow expert set, that set's |
| 54 | + bias projection feeds back into the residual stream repetitively, |
| 55 | + DeltaNet state self-reinforces, **positive feedback loop** locks |
| 56 | + onto a repeating phrase. |
| 57 | +4. T=2.0 spreads the softmax output: top-1 share drops, competing |
| 58 | + experts contribute more, no single expert's bias dominates residual |
| 59 | + → the loop can't form. |
| 60 | + |
| 61 | +The 4B dense-hybrid model (Qwen3.5-4B, DeltaNet + dense FFN, no MoE) |
| 62 | +does NOT drift on the same prompt — R24 isolated this. Confirms the |
| 63 | +drift is a MoE-specific pathology, not DeltaNet's fault. |
| 64 | + |
| 65 | +## What T=2.0 does NOT fix |
| 66 | + |
| 67 | +- Tail quality from ~150 to 300 tokens degrades to character-level |
| 68 | + noise (alphabet-walking "'a'b'c'd'e") on longer `-n 500` runs. |
| 69 | + Probably quantization + DeltaNet state accumulation compounding. |
| 70 | +- The specific "Sorry!" mini-loop appears around 170 tokens at T=2.0 — |
| 71 | + doesn't trigger engine's rep-loop detector but is human-visible. |
| 72 | + |
| 73 | +So: T=2.0 **breaks the hard 117-tok cliff** and recovers ~50 additional |
| 74 | +coherent tokens. Full essay-length generation still needs more work. |
| 75 | + |
| 76 | +## Safety |
| 77 | + |
| 78 | +- `"The capital of France is"` → `"Paris."` (correct) at T=2.0 |
| 79 | +- `bash scripts/test_models.sh` → **23/23 PASS** with T=2.0 |
| 80 | + (15 coherence + 8 BPE-stale-entry + 3 BPE-UTF-8 direct-byte, no diff) |
| 81 | + |
| 82 | +## Recommended user config |
| 83 | + |
| 84 | +Best Qwen3.6-35B recipe on 16 GB Mac today: |
| 85 | + |
| 86 | +```bash |
| 87 | +TQ_MOE_ROUTE_TEMP=2.0 \ |
| 88 | + ./build/quant models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \ |
| 89 | + -p "<your prompt>" -n 200 -T 0 --rep-penalty 1.3 |
| 90 | +``` |
| 91 | + |
| 92 | +Combine with `Q5_K_M` GGUF for best quality (200-tok coherent range) |
| 93 | +and `--rep-penalty 1.3` as belt-and-suspenders. |
| 94 | + |
| 95 | +## The arc |
| 96 | + |
| 97 | +- R1-R19: "Drift is DeltaNet state" → R19 single-layer reset bisection |
| 98 | + proves NOT true |
| 99 | +- R24: 4B dense hybrid works fine → drift is MoE-specific |
| 100 | +- R25: MoE probe → L4 single-expert collapse at long positions |
| 101 | +- **R26**: Softmax temperature ablation → **cliff broken at T=2.0** |
| 102 | + |
| 103 | +Total investigation: 26 rounds. The actual fix: 5 lines of C. |
| 104 | + |
| 105 | +See also: |
| 106 | +- `docs/env_vars.md` — `TQ_MOE_ROUTE_TEMP` row |
| 107 | +- `docs/supported_models_tier.md` — 35B recipe updated |
| 108 | +- `.claude/state.md` — R16-R26 reasoning chain |
0 commit comments