Skip to content

Commit 88ed094

Browse files
unamedkrclaude
andcommitted
★★★ state: R24 breakthrough — drift is MoE×DeltaNet, NOT DeltaNet alone
Qwen3.5-4B (DeltaNet + dense FFN, no MoE) on the EXACT 35B drift-trigger prompt "Once upon a time in a faraway land" -n 200 T=0: → 200 coherent tokens about Lily the explorer, Wizard Wigglesworth, math puzzles (5×3=15), multiple story beats, NO repetition loop. 35B (DeltaNet + MoE 256-expert K=8) on the same prompt: → 117 tokens → "It could do math! It could do math!" loop. All prior rounds R16-R19 assumed DeltaNet state was the sole drift cause. WRONG. DeltaNet works fine without MoE. The 117-tok cliff emerges from the *interaction* — DeltaNet carries the "math math" semantic state, MoE top-K routing locks onto experts that amplify it, positive feedback loop. Memory task #192 (MoE router softmax sanity at long positions) now the leading investigation. Next: instrument top-K entropy + expert histogram at positions 50/100/115/120 on the 35B drift prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 61f7ac0 commit 88ed094

1 file changed

Lines changed: 30 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,36 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★★★ Phase 1 R24 — Drift is MoE×DeltaNet interaction, NOT DeltaNet alone (2026-04-21) ★★★
7+
8+
Ran Qwen3.5-4B Q4_K_M (dense FFN + DeltaNet hybrid, **no MoE**) on the
9+
exact drift-trigger prompt "Once upon a time in a faraway land" -n 200:
10+
11+
```
12+
…Lily the explorer met Wizard Wigglesworth who challenges her with
13+
math puzzles. She solves 5×3=15, reasons aloud, continues confidently
14+
through multiple story beats. 200 tokens, zero repetition loop.
15+
```
16+
17+
35B (DeltaNet + MoE) hits "It could do math!" at 117. 4B (DeltaNet +
18+
dense FFN) goes 200+ coherent on the same prompt.
19+
20+
**All prior rounds (R16-R19) assumed DeltaNet state was the sole cause.
21+
This is wrong.** DeltaNet works fine without MoE. The 117-token cliff
22+
emerges from the *interaction* between MoE routing and DeltaNet's
23+
persistent state — not from either in isolation.
24+
25+
**New hypothesis**: MoE top-K expert routing becomes pathological at
26+
long positions (either collapsing to a stuck subset of experts, or
27+
routing on a DeltaNet-state-driven signal that locks in a loop). The
28+
DeltaNet state holds the "math math math" semantic; MoE keeps selecting
29+
the experts that most agree with that signal; positive feedback loop.
30+
31+
Memory task #192 ("MoE router weight softmax sanity at long positions")
32+
becomes the leading follow-up. Concrete next step: instrument the
33+
router's top-K entropy and expert-selection histogram at positions
34+
50/100/115/120 on the 35B drift-trigger prompt.
35+
636
## Phase 1 R19 — Single-layer reset is not enough — drift is distributed (2026-04-21)
737

838
Added `TQ_DELTA_RESET_LAYER=N` env to bisect which DeltaNet layer drives

0 commit comments

Comments
 (0)