Skip to content

Commit b8291c3

Browse files
unamedkrclaude
andcommitted
state: R44 — 1000-tok hunt summary, 4B vs 35B 5x coherent-window gap measured
R43 shared-expert Q4-skip fix: +20% on short prompts (117→204 tok). Real landed quality gain. R44 measurement with same 63-tok prefill prompt: Qwen3.5-4B dense hybrid: 347 coherent gen tokens (+ 63 prefill = 410 total) Qwen3.6-35B MoE hybrid: 65 coherent gen tokens (+ 63 prefill = 128 total) 4B is 5× better. Both have DeltaNet; only 35B has MoE. The remaining degradation is MoE-internal and not fixed by: KV cache quant off, rep-penalty, k-window, router temp sweep (1.0-2.5 all fail differently), TQ_NO_Q4, or any easily-patchable path we've audited. Still-unvetted candidates: routed-expert runtime dispatch accumulator, MoE output aggregation order (8 experts × 40 layers × 500 tokens summation drift), DeltaNet a_log in Q4_K form, LM head Q8_0 matmul at long positions. 1000-tok coherent NOT achieved this session. But concrete direction: need to run llama.cpp on same 35B+prompt to establish absolute achievable ceiling, then surgically target the residual MoE precision gap. DRY sampler as external safety net. Methodology: user's Q4-commonness pushback was the key to R43's fix. Same approach will be needed to find the remaining MoE bug — compare directly vs reference, not speculate from symptoms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d0508a8 commit b8291c3

1 file changed

Lines changed: 75 additions & 0 deletions

File tree

.claude/state.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,81 @@
33
**Last updated**: 2026-04-22 (Phase 2 KV clean-bill)
44
**Session HEAD**: turbo_kv_4b per-arch per-layer clean-bill LANDED via chunked TQ_KV_PROBE. 7×/+0% PPL claim now validated element-by-element across Llama, Qwen3-0.6B, Qwen3.5-4B, Qwen3.6-35B.
55

6+
## Phase 3 R44 — 1000-tok hunt: engine-gap confirmed, shared-expert fix +20%, remaining gap is MoE-internal (2026-04-22)
7+
8+
User pushback ("Q4 is common — why fail?") correctly identified engine
9+
bug, not architecture. R43 fix: disable shared-expert double-quant for
10+
qwen35moe arch. +20% coherent window on short prompts (170→204 tok).
11+
12+
R44 investigation: what else is engine-specific?
13+
14+
### Quantitative 4B-vs-35B gap measurement (63-tok prefill, n=500, same prompt)
15+
16+
| Model | Post-fixes coherent gen | Total position |
17+
|---|---:|---:|
18+
| Qwen3.5-4B (dense hybrid, no MoE) | 347 tokens | 410 |
19+
| Qwen3.6-35B UD-IQ4_XS (MoE hybrid) | ~65 tokens | 128 |
20+
| Qwen3.6-35B UD-Q5_K_M (MoE hybrid) | ~65 tokens | ~128 |
21+
| **Ratio** | **4B is 5× better** ||
22+
23+
Both hybrids have DeltaNet. Only 35B has MoE. Both hit a loop eventually
24+
but 35B hits it MUCH sooner. Some MoE-specific precision loss remains.
25+
26+
### TEMP sweep on long prefill (revealing distinct attractors)
27+
28+
| TEMP | behavior |
29+
|---:|:---|
30+
| 1.0 | "3 4 5 6 7 8..." number-sequence crash in 10 tok |
31+
| 1.5 | "A dragon. A dragon" loop at 66 tok |
32+
| 2.0 | `**The End.**` — model emits EOS (prompt feels complete) |
33+
| 2.5 | "898989" corruption at 58 tok |
34+
35+
Each TEMP reveals a different failure mode. No temperature fully fixes
36+
the 35B MoE + long-prefill situation.
37+
38+
### What's NOT the cause (eliminated)
39+
40+
- KV cache quantization (`-k none` same output)
41+
- TQ_NO_Q4=1 conversion bypass (only catches non-MoE weights)
42+
- Router softmax temp (swept, none fix)
43+
- rep-penalty (no effect on MoE loops)
44+
- k-window (no effect)
45+
- Shared expert double-quant (fixed in R43, +20% but not full)
46+
- Attention output gate (matches llama.cpp exactly)
47+
- chat template (both Thinking and Instruct modes handled)
48+
- QK-norm OFF for hybrid (empirically beats ON)
49+
- partial_rotary (hardcoded 0.25 for hybrid)
50+
- attention layers' Q8_0 conversion (auto-skipped)
51+
52+
### What's still suspect
53+
54+
Engine has ~20+ code paths that could each have a subtle precision
55+
difference vs llama.cpp. Candidates still unvetted:
56+
- Routed expert runtime dispatch (mmap on-the-fly GGUF dequant — should
57+
match llama.cpp but details of accumulator precision unchecked)
58+
- MoE output aggregation: `output[i] += weight[k] * expert_out[k]`
59+
per-expert then per-layer. FP32 throughout, but still 8 experts ×
60+
40 layers × 500 tokens = potential summation-order drift
61+
- DeltaNet β, α computation under Q4_K for a_log weight
62+
- LM head logits computation at long positions (Q8_0 matmul accumulator)
63+
64+
### Honest status
65+
66+
1000-tok coherent on 35B NOT achieved this session. Moved from 117
67+
(pre-v0.28.0) to 204 (post-R43). Real progress but ceiling is
68+
session-exhaustible only with more hunting for engine-specific
69+
precision paths MoE-wise.
70+
71+
### Next-session attack vectors
72+
73+
1. Run llama.cpp on the same 35B + same prompt — establish absolute
74+
reference for what's achievable at this weight. If llama.cpp gets
75+
1000+ coherent, the gap IS engine-specific and surgery continues.
76+
If llama.cpp also hits ~300, this quant+arch combo has inherent limit.
77+
2. Port DRY sampler — non-negotiable community-standard mitigation.
78+
3. Bisect MoE routed-expert path: add a probe that dumps per-token
79+
expert output rms and compares to 4B dense FFN rms trajectory.
80+
681
## Phase 3 R42 — Cross-validation of H1-H3 hypotheses from R41 (2026-04-22)
782

883
All six architecture-grounded hypotheses from R41 cross-verified

0 commit comments

Comments
 (0)