Skip to content

Commit d0508a8

Browse files
unamedkrclaude
andcommitted
★★ fix(moe): skip shared-expert Q4 double-quant on qwen35moe — long-gen +20%
User pushback: "Q4 quantization is commonly used — why does ours fail?" Correct callout. Our failure vs llama.cpp is engine-specific, not Q4-architecture-interaction. Hunt: tq_model.c:4210 unconditionally re-quantized shared-expert weights from GGUF-native (Q6_K/Q4_K) → FP32 → our internal per-32-absmax Q4. Double-quant path precisely matches memory note feedback_double_quant_q8_to_q4 ("never recompress downward in bpw"). Main weight conversion path skipped for Qwen3.6 (Q8_0 attn fix in ea01222), but shared-expert conversion below the skip label was ALWAYS unconditional. Shared experts run on EVERY decoded token (they're "shared" = always active, contrast with top-K routed). So every token absorbed the double-quant precision loss on 40 layers × 3 matmuls per layer. Fix: auto-skip for qwen35moe arch (delta_n_heads > 0 detector, same pattern as auto-serial and auto-moe-temp). Other MoE (Gemma 4 etc) keep prior behavior. Opt-out TQ_FORCE_Q4_SHARED=1. Measured on Qwen3.6-35B UD-IQ4_XS, "Once upon a time in a faraway land", -n 300 T=0 (auto-serial + auto-moe-temp already on): Baseline (with Q4 conversion): ~170 coherent, Sorry! alphabet walk With TQ_NO_Q4_SHARED=1: 204 coherent, new content (Alex + magical door + flying creatures) then soft "had had" loop at 204 +20% coherent window. Content qualitatively different story — no longer stuck in the Sorry!/sad-tree attractor that the double-quant path produced. Remaining "had had" loop is a separate attractor (likely different root cause — routing or attention precision at long positions). Speed within noise (2.6-3.5 t/s both ways). Factual probes correct ("Paris, a city renowned for its iconic landmarks"). 23/23 regression PASS. Methodology: this is exactly what the user's pushback demanded. The Q4 quant on Qwen3.6 works fine universally — our ENGINE had an unconditional shared-expert re-quant path that other engines don't. Research + user-intuition combined > empirical ablation alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d287f1a commit d0508a8

1 file changed

Lines changed: 13 additions & 1 deletion

File tree

src/engine/tq_model.c

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4206,8 +4206,20 @@ skip_q4_conversion: ;
42064206
* layer stay in Q4 form, keeping memory under ~2 GB.
42074207
*
42084208
* Shared experts are always active, so convert them at load time.
4209+
*
4210+
* R43 diagnostic: Qwen3.6-35B long-gen drift suspected to come
4211+
* from shared-expert double-quant. TQ_NO_Q4_SHARED=1 skips this
4212+
* conversion — forward path uses on-the-fly GGUF dequant instead
4213+
* (like routed experts), matching llama.cpp behavior.
42094214
* ============================================================ */
4210-
if (c->is_moe) {
4215+
/* R43 auto-default: for qwen35moe hybrid (DeltaNet + MoE + shared
4216+
* experts), the double-quant (GGUF Q6_K/Q4_K → FP32 → internal
4217+
* per-32-absmax Q4) on shared experts measurably degrades long-gen
4218+
* coherence (170 → 204 tok on drift-trigger prompt). Skip by
4219+
* default for this arch; keep for other MoE (Gemma 4 etc). Opt-out
4220+
* with TQ_FORCE_Q4_SHARED=1. */
4221+
int _auto_skip_shared_q4 = (c->delta_n_heads > 0 && !getenv("TQ_FORCE_Q4_SHARED"));
4222+
if (c->is_moe && !getenv("TQ_NO_Q4_SHARED") && !_auto_skip_shared_q4) {
42114223
int shared_inter = c->shared_expert_intermediate_dim;
42124224
if (shared_inter == 0) shared_inter = c->expert_intermediate_dim;
42134225

0 commit comments

Comments
 (0)