Skip to content

Commit 51768b2

Browse files
unamedkrclaude
andcommitted
★★★ research: R41 Qwen3-Next architecture — correct hypothesis from refs + web
User called out that I'd been relying on empirical ablation without understanding the model. Switched to research mode. Key findings (from refs/llama.cpp/src/models/qwen35moe.cpp + qwen3next.cpp + vLLM blog + Qwen papers + Gated Attention NeurIPS 2025): 1. Qwen3-Next = 75% DeltaNet + 25% attention (3:1), not 1:1. For 40-layer 35B only 10 attention layers. 2. Gated Attention (NeurIPS 2025, Qwen team paper) is THE long-context stabilizer — head-wise sigmoid(gate) on SDPA output REPLACES attention sinks. If our engine doesn't apply this gate correctly on those 10 attention layers, long-gen drift is architecturally predicted. Single Q projection outputs 2× dim (Q + gate), post-attn multiplies by sigmoid(gate). See qwen35moe.cpp:156, 186-189. 3. Instruct vs Thinking are DIFFERENT checkpoints requiring DIFFERENT chat templates. Instruct must NOT be primed with <think>. Ours always primes empty <think>\\n\\n</think>\\n\\n — potentially OOD if our 35B is Instruct. 4. Gated-DeltaNet's known failure modes (ICLR 2025 paper): α saturation at ~1.0, compression bottleneck of fixed-size state. 5. DRY sampler (oobabooga PR #5677) is community-standard for hybrid loop-collapse — we don't have it. New hypotheses, ranked: H1: attn_output_gate missing/buggy in our self_attn_forward H2: chat template mismatch for Instruct variant H3: DeltaNet α saturation (R26-29 attacked, not verified beyond 200 tok) Plan: audit attn_output_gate line-by-line vs qwen35moe.cpp:129-189 (R42), fix confirmed bug (R43), port DRY sampler as safety net (R44). Methodology lesson: reference > introspection. Research is cheaper than 15+ empirical rounds that leave us uncertain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f148bde commit 51768b2

1 file changed

Lines changed: 79 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,85 @@
33
**Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
44
**Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
55

6+
## ★★★ Phase 3 R41 — ARCHITECTURE RESEARCH BREAKTHROUGH (2026-04-22) ★★★
7+
8+
User callout (correct): "모델의 특성에 대해 제대로 이해하지 못한 상태에서
9+
너무 실험에만 의존". Switched to deep research (refs/ + web) before more
10+
experiments. Documented findings below.
11+
12+
### What we MISUNDERSTOOD about Qwen3.6-35B-A3B / Qwen3-Next
13+
14+
1. **3:1 hybrid ratio, not 1:1**. 75% DeltaNet + 25% attention layers.
15+
For 40-layer 35B: only **10 attention layers**, 30 DeltaNet. Attention
16+
is the MINORITY. (Source: vLLM blog, Qwen tech reports.)
17+
18+
2. **Gated Attention (NeurIPS 2025, Qwen team)** is THE key stabilizer.
19+
Head-wise `sigmoid(Wg·x)` applied elementwise to SDPA output — replaces
20+
attention sinks, enables 1M-ctx training. If our engine doesn't apply
21+
this gate correctly, attention layers lack sink-mitigation → long-gen
22+
drift is PREDICTED by the architecture.
23+
- Qwen single Q projection outputs 2× dim: `Q` + `gate` split along dim 0
24+
- Post-attn: `attn_out = attn_out × sigmoid(gate)`
25+
- Our config has `c->attn_output_gate` field — need to verify impl works
26+
- Refs: `refs/llama.cpp/src/models/qwen35moe.cpp:156, 186-189`
27+
- Refs: `refs/llama.cpp/src/models/qwen3next.cpp:165-172`
28+
29+
3. **Instruct vs Thinking are DIFFERENT checkpoints** with DIFFERENT
30+
expected chat templates:
31+
- Instruct: NO `<think>` priming; HF card says "does not generate
32+
`<think></think>` blocks"
33+
- Thinking: DOES use `<think>\n` open
34+
- **Our default primes empty `<think>\n\n</think>\n\n`** — if our
35+
35B GGUF is actually the Instruct variant, we're feeding OOD input
36+
37+
4. **qwen35moe uses IMRoPE (multi-section); qwen3next uses standard RoPE**.
38+
We need to verify which one our 35B is, and that our dispatch is correct.
39+
40+
5. **Known Gated-DeltaNet failure modes** (ICLR 2025 paper):
41+
- Fixed-size recurrent state is a compression bottleneck
42+
- α (decay) near 1.0 causes state saturation — numerical accumulation
43+
in β/α is "known-fragile"
44+
- Hybridization with full-attention layers is the REMEDY the paper
45+
explicitly proposes. Qwen3-Next adopts this.
46+
47+
6. **DRY sampler** (Dynamic N-gram repetition penalty, oobabooga PR #5677)
48+
is the community-standard mitigation for loop-collapse in hybrid archs.
49+
llama.cpp has it, our engine doesn't.
50+
51+
### New hypothesis (architecture-grounded, not empirical guessing)
52+
53+
Ranked by strength of evidence:
54+
55+
**H1 (★ highest prior)**: our `attn_output_gate` implementation is
56+
missing or buggy on the 10 attention layers → no sink mitigation → long-
57+
gen attention drift → the "Sorry!" / alphabet-walk attractors we've been
58+
seeing in R26/R38/R40 without root-cause explanation.
59+
60+
**H2**: our GGUF is the Instruct variant (non-thinking), but our chat
61+
template primes empty `<think>\n\n</think>\n\n` → out-of-distribution
62+
input distribution → trained attractors engaged incorrectly.
63+
64+
**H3**: DeltaNet α saturation in specific heads at specific positions.
65+
Rounds 26-29 already attacked this with exact expf, but we didn't verify
66+
α stays bounded at pos 500+.
67+
68+
### Plan (not-more-experiments-first, research-ground-all-changes)
69+
70+
- R42: GGUF metadata check (Instruct vs Thinking) + audit
71+
`attn_output_gate` in our `self_attn_forward` — line-by-line vs
72+
qwen35moe.cpp:129-189
73+
- R43: if H1 or H2 is confirmed, fix and re-measure 1000-tok target
74+
- R44: port DRY sampler from llama.cpp as belt-and-suspenders
75+
- R45: final validation
76+
77+
### Methodology lesson saved
78+
79+
This is the third confirmation of the meta-insight we added to
80+
`MEMORY.md`: **reference > introspection**. BPE and MoE bugs fell in
81+
2-3 rounds once we had a reference to diff against. 35B long-gen stayed
82+
mysterious for 15+ rounds because we had no architectural reference
83+
handy. Research makes experiments targeted.
84+
685
## Phase 3 R40 — Meaningful prompt + thinking-mode still hits NEW attractor (2026-04-22)
786

887
User follow-up: "모델이 의미있는 긴 문장을 생성하도록 유의미한 질문도 생성해야 하는거 아닌가요?"

0 commit comments

Comments
 (0)