Commit d287f1a
research(r42): R41 hypotheses cross-validated — all known fixes already applied
Systematic check of six architecture-grounded hypotheses against
refs/llama.cpp/src/models/qwen35moe.cpp + GGUF metadata:
✓ attn_output_gate layout & sigmoid match per-head interleaved
✓ chat template correctly handles Thinking + Instruct via enable_thinking
✓ QK-norm OFF empirically beats ON (force_on crashes at 80 tok)
✓ RMSNorm uses raw `w` in both our impl and ggml_build_norm
✓ Qwen3.6 rope is NEOX (dimension_sections=[0]), not IMRoPE
✓ partial_rotary_factor=0.25 hardcoded for hybrid arch
All six patchable candidates already correctly implemented.
What remains is architectural, not fix-able at our layer:
Gated DeltaNet α-saturation (ICLR 2025 paper's noted fragility):
when trained a_log values are very negative, decay ≈ 1 and those
heads have no per-step state decay. Over 1000 steps × 30 DeltaNet
layers, quantization + FP-summation noise compounds geometrically.
The paper's remedy is hybridization with attention — Qwen3-Next
does 25%. But that compensation depends on attention being
numerically perfect. Under Q4/Q5 weights + quantized KV, it
isn't. Long-gen drift on quantized 35B is architecturally
predicted.
Path forward — what we CAN ship:
R43: port DRY sampler (llama.cpp has it; we don't). Pattern-level
rep penalty. Directly addresses the Sorry!/requirements loops via
sampling-time intervention. Does NOT fix residual-collapse or
α-saturation but breaks the repetition attractors at the only
layer we control.
Methodology lesson: research validated the session's empirical
decisions (QK-norm OFF was correct, TEMP=2.0 was correct). The
1000-tok target itself may be out of reach without FP32 inference
or upstream arch changes — honest ceiling documented.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 51768b2 commit d287f1a
1 file changed
Lines changed: 43 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
6 | 49 | | |
7 | 50 | | |
8 | 51 | | |
| |||
0 commit comments