Skip to content

Commit d287f1a

Browse files
unamedkrclaude
andcommitted
research(r42): R41 hypotheses cross-validated — all known fixes already applied
Systematic check of six architecture-grounded hypotheses against refs/llama.cpp/src/models/qwen35moe.cpp + GGUF metadata: ✓ attn_output_gate layout & sigmoid match per-head interleaved ✓ chat template correctly handles Thinking + Instruct via enable_thinking ✓ QK-norm OFF empirically beats ON (force_on crashes at 80 tok) ✓ RMSNorm uses raw `w` in both our impl and ggml_build_norm ✓ Qwen3.6 rope is NEOX (dimension_sections=[0]), not IMRoPE ✓ partial_rotary_factor=0.25 hardcoded for hybrid arch All six patchable candidates already correctly implemented. What remains is architectural, not fix-able at our layer: Gated DeltaNet α-saturation (ICLR 2025 paper's noted fragility): when trained a_log values are very negative, decay ≈ 1 and those heads have no per-step state decay. Over 1000 steps × 30 DeltaNet layers, quantization + FP-summation noise compounds geometrically. The paper's remedy is hybridization with attention — Qwen3-Next does 25%. But that compensation depends on attention being numerically perfect. Under Q4/Q5 weights + quantized KV, it isn't. Long-gen drift on quantized 35B is architecturally predicted. Path forward — what we CAN ship: R43: port DRY sampler (llama.cpp has it; we don't). Pattern-level rep penalty. Directly addresses the Sorry!/requirements loops via sampling-time intervention. Does NOT fix residual-collapse or α-saturation but breaks the repetition attractors at the only layer we control. Methodology lesson: research validated the session's empirical decisions (QK-norm OFF was correct, TEMP=2.0 was correct). The 1000-tok target itself may be out of reach without FP32 inference or upstream arch changes — honest ceiling documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 51768b2 commit d287f1a

1 file changed

Lines changed: 43 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,49 @@
33
**Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
44
**Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
55

6+
## Phase 3 R42 — Cross-validation of H1-H3 hypotheses from R41 (2026-04-22)
7+
8+
All six architecture-grounded hypotheses from R41 cross-verified
9+
against refs/llama.cpp and GGUF metadata:
10+
11+
| Hypothesis | Check method | Result |
12+
|---|---|---|
13+
| attn_output_gate wrong | layout vs qwen35moe.cpp:129-190 | CORRECT (per-head interleaved matches) |
14+
| chat template OOD | `enable_thinking` Jinja branches | CORRECT (both modes supported) |
15+
| QK-norm should be ON | empirical A/B (force on vs off) | OFF is better (on=80-tok crash, off=170) |
16+
| "1+w" zero-centered norm | ggml_build_norm source | raw `w` matches our impl |
17+
| IMRoPE (multi-section) | GGUF dimension_sections | `[0]` = NEOX, not IMRoPE |
18+
| partial_rotary missing | grep partial_rotary_factor | 0.25 hardcoded for hybrid arch |
19+
20+
**All known easily-patchable candidates already correctly implemented.**
21+
22+
### What remains — architectural
23+
24+
**DeltaNet α saturation** (ICLR 2025 Gated DeltaNet paper, §3 fragility):
25+
When `a_log` weights train to very negative values, `-exp(a_log) ≈ 0`,
26+
`gate ≈ 0`, `decay = exp(gate) ≈ 1`. Those heads have NO per-step
27+
state decay. Over 1000 generation steps × 30 DeltaNet layers, any
28+
numerical noise (quantization, FP summation order) compounds
29+
geometrically. The paper's own remedy for this is hybridization with
30+
some full-attention — which Qwen3-Next adopts (25%). But the
31+
compensation depends on the attention layers being NUMERICALLY
32+
PERFECT. Under Q4/Q5 weight quantization + quantized KV cache, they
33+
aren't. So long-gen drift on quantized Qwen3.6-35B is a predictable
34+
consequence of the architecture's known-fragility meeting our
35+
quantization stack.
36+
37+
**Implication**: the 1000-tok target may require either:
38+
1. DRY sampler (external — pattern-level rep penalty, llama.cpp has it)
39+
2. FP16/FP32 inference (not feasible on 16 GB Mac for 35B)
40+
3. A smaller hybrid variant with less long-memory head budget
41+
4. Upstream architectural changes we can't make
42+
43+
R43 plan: port DRY sampler. It's the only mitigation we can ship
44+
that directly addresses pattern-level loops (Sorry/requirements).
45+
Does NOT fix residual-collapse or α-saturation but does break
46+
repetition attractors at sampling time — empirically effective on
47+
hybrid models per community reports.
48+
649
## ★★★ Phase 3 R41 — ARCHITECTURE RESEARCH BREAKTHROUGH (2026-04-22) ★★★
750

851
User callout (correct): "모델의 특성에 대해 제대로 이해하지 못한 상태에서

0 commit comments

Comments
 (0)