★★ fix(moe): skip shared-expert Q4 double-quant on qwen35moe — long-gen +20%

unamedkr · claude · unamedkr · commit d0508a83b450 · 2026-04-22T09:42:03.000+09:00
User pushback: "Q4 quantization is commonly used — why does ours fail?" Correct callout. Our failure vs llama.cpp is engine-specific, not Q4-architecture-interaction. Hunt: tq_model.c:4210 unconditionally re-quantized shared-expert weights from GGUF-native (Q6_K/Q4_K) → FP32 → our internal per-32-absmax Q4. Double-quant path precisely matches memory note feedback_double_quant_q8_to_q4 ("never recompress downward in bpw"). Main weight conversion path skipped for Qwen3.6 (Q8_0 attn fix in ea01222), but shared-expert conversion below the skip label was ALWAYS unconditional. Shared experts run on EVERY decoded token (they're "shared" = always active, contrast with top-K routed). So every token absorbed the double-quant precision loss on 40 layers × 3 matmuls per layer. Fix: auto-skip for qwen35moe arch (delta_n_heads > 0 detector, same pattern as auto-serial and auto-moe-temp). Other MoE (Gemma 4 etc) keep prior behavior. Opt-out TQ_FORCE_Q4_SHARED=1. Measured on Qwen3.6-35B UD-IQ4_XS, "Once upon a time in a faraway land", -n 300 T=0 (auto-serial + auto-moe-temp already on): Baseline (with Q4 conversion): ~170 coherent, Sorry! alphabet walk With TQ_NO_Q4_SHARED=1: 204 coherent, new content (Alex + magical door + flying creatures) then soft "had had" loop at 204 +20% coherent window. Content qualitatively different story — no longer stuck in the Sorry!/sad-tree attractor that the double-quant path produced. Remaining "had had" loop is a separate attractor (likely different root cause — routing or attention precision at long positions). Speed within noise (2.6-3.5 t/s both ways). Factual probes correct ("Paris, a city renowned for its iconic landmarks"). 23/23 regression PASS. Methodology: this is exactly what the user's pushback demanded. The Q4 quant on Qwen3.6 works fine universally — our ENGINE had an unconditional shared-expert re-quant path that other engines don't. Research + user-intuition combined > empirical ablation alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/src/engine/tq_model.c b/src/engine/tq_model.c
@@ -4206,8 +4206,20 @@ skip_q4_conversion: ;
          * layer stay in Q4 form, keeping memory under ~2 GB.
          *
          * Shared experts are always active, so convert them at load time.
+         *
+         * R43 diagnostic: Qwen3.6-35B long-gen drift suspected to come
+         * from shared-expert double-quant. TQ_NO_Q4_SHARED=1 skips this
+         * conversion — forward path uses on-the-fly GGUF dequant instead
+         * (like routed experts), matching llama.cpp behavior.
          * ============================================================ */
-        if (c->is_moe) {
+        /* R43 auto-default: for qwen35moe hybrid (DeltaNet + MoE + shared
+         * experts), the double-quant (GGUF Q6_K/Q4_K → FP32 → internal
+         * per-32-absmax Q4) on shared experts measurably degrades long-gen
+         * coherence (170 → 204 tok on drift-trigger prompt). Skip by
+         * default for this arch; keep for other MoE (Gemma 4 etc). Opt-out
+         * with TQ_FORCE_Q4_SHARED=1. */
+        int _auto_skip_shared_q4 = (c->delta_n_heads > 0 && !getenv("TQ_FORCE_Q4_SHARED"));
+        if (c->is_moe && !getenv("TQ_NO_Q4_SHARED") && !_auto_skip_shared_q4) {
             int shared_inter = c->shared_expert_intermediate_dim;
             if (shared_inter == 0) shared_inter = c->expert_intermediate_dim;