Commit d0508a8
★★ fix(moe): skip shared-expert Q4 double-quant on qwen35moe — long-gen +20%
User pushback: "Q4 quantization is commonly used — why does ours fail?"
Correct callout. Our failure vs llama.cpp is engine-specific, not
Q4-architecture-interaction.
Hunt: tq_model.c:4210 unconditionally re-quantized shared-expert weights
from GGUF-native (Q6_K/Q4_K) → FP32 → our internal per-32-absmax Q4.
Double-quant path precisely matches memory note feedback_double_quant_q8_to_q4
("never recompress downward in bpw"). Main weight conversion path
skipped for Qwen3.6 (Q8_0 attn fix in ea01222), but shared-expert
conversion below the skip label was ALWAYS unconditional.
Shared experts run on EVERY decoded token (they're "shared" = always
active, contrast with top-K routed). So every token absorbed the
double-quant precision loss on 40 layers × 3 matmuls per layer.
Fix: auto-skip for qwen35moe arch (delta_n_heads > 0 detector, same
pattern as auto-serial and auto-moe-temp). Other MoE (Gemma 4 etc)
keep prior behavior. Opt-out TQ_FORCE_Q4_SHARED=1.
Measured on Qwen3.6-35B UD-IQ4_XS, "Once upon a time in a faraway land",
-n 300 T=0 (auto-serial + auto-moe-temp already on):
Baseline (with Q4 conversion): ~170 coherent, Sorry! alphabet walk
With TQ_NO_Q4_SHARED=1: 204 coherent, new content (Alex +
magical door + flying creatures)
then soft "had had" loop at 204
+20% coherent window. Content qualitatively different story — no longer
stuck in the Sorry!/sad-tree attractor that the double-quant path
produced. Remaining "had had" loop is a separate attractor (likely
different root cause — routing or attention precision at long positions).
Speed within noise (2.6-3.5 t/s both ways). Factual probes correct
("Paris, a city renowned for its iconic landmarks"). 23/23 regression
PASS.
Methodology: this is exactly what the user's pushback demanded. The
Q4 quant on Qwen3.6 works fine universally — our ENGINE had an
unconditional shared-expert re-quant path that other engines don't.
Research + user-intuition combined > empirical ablation alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d287f1a commit d0508a8
1 file changed
Lines changed: 13 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4206 | 4206 | | |
4207 | 4207 | | |
4208 | 4208 | | |
| 4209 | + | |
| 4210 | + | |
| 4211 | + | |
| 4212 | + | |
| 4213 | + | |
4209 | 4214 | | |
4210 | | - | |
| 4215 | + | |
| 4216 | + | |
| 4217 | + | |
| 4218 | + | |
| 4219 | + | |
| 4220 | + | |
| 4221 | + | |
| 4222 | + | |
4211 | 4223 | | |
4212 | 4224 | | |
4213 | 4225 | | |
| |||
0 commit comments