fix(qwen35moe): contiguous top_k router ids when host-read (hybrid spark crash) (#481)

davide221 · claude · web-flow · commit d5bf8ba0dca2 · 2026-07-02T22:00:06.000+02:00
Regression from #472: ggml_argsort_top_k returns a strided VIEW into the full [n_expert, n_tokens] argsort. The hybrid hot/cold path reads the router outputs back with a raw packed ggml_backend_tensor_get (qwen35moe_backend.cpp prefill chunk readback), which silently yields garbage expert ids for every token after the first -> corrupted hot/cold dispatch -> CUDA illegal memory access on the first multi-token spark prefill. Decode reads a single row and was unaffected, which is why #472 validation (all-hot + decode) passed. Scoped fix: build_qwen35moe_router(allow_fused_router=false) at the hybrid export site emits contiguous ggml_top_k ids. The in-graph FFN path keeps argsort_top_k and its CUDA topk-moe fusion (all-hot output verified byte-identical, hash 8cda68b0ca5eb797, fused and unfused). Validated on lucebox2 (3090): spark repro 2x clean + coherent (crashed 100% before), all-hot long prefill byte-identical, unit suite 2050 assertions 0 failures. Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/server/src/qwen35/qwen35_target_graph.cpp b/server/src/qwen35/qwen35_target_graph.cpp
@@ -1417,7 +1417,13 @@ QwenLayerPrefnOutputs build_qwen35_layer_prefn(
     out.residual = cur;
     out.post = rms_norm_mul(ctx, cur, L.attn_post_norm, eps);
     if (w.is_moe) {
-        Qwen35MoeRouterOutputs router = build_qwen35moe_router(ctx, out.post, w, L);
+        // selected/weights are read back by the host (hybrid hot/cold expert
+        // compute), not consumed in-graph. argsort_top_k yields a strided view
+        // whose raw packed readback returns garbage ids for tokens > 0 (crash
+        // in expert dispatch on the first multi-token prefill); top_k is
+        // contiguous, and cheaper than a full argsort here.
+        Qwen35MoeRouterOutputs router = build_qwen35moe_router(
+            ctx, out.post, w, L, /*allow_fused_router=*/false);
         out.moe_selected = router.selected;
         out.moe_weights = router.weights;
     }
diff --git a/server/src/qwen35moe/qwen35moe_ffn.cpp b/server/src/qwen35moe/qwen35moe_ffn.cpp
@@ -11,7 +11,8 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
     ggml_context *        ctx,
     ggml_tensor *         cur,
     const TargetWeights & w,
-    const TargetLayer &   L) {
+    const TargetLayer &   L,
+    bool                  allow_fused_router) {
     const int n_tokens = (int)cur->ne[1];
     const int n_expert = w.n_expert;
     const int n_used   = w.n_expert_used;
@@ -35,7 +36,7 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
     // x30 MoE layers (the launch-bound decode gap vs llama, which uses argsort_top_k).
     // Same top-k selection -> bit-identical. DFLASH_NO_MOE_ROUTER_FUSE=1 = old path.
     static const bool router_fuse = (std::getenv("DFLASH_NO_MOE_ROUTER_FUSE") == nullptr);
-    ggml_tensor * selected = router_fuse
+    ggml_tensor * selected = (router_fuse && allow_fused_router)
         ? ggml_argsort_top_k(ctx, probs, n_used)
         : ggml_top_k(ctx, probs, n_used);
 
diff --git a/server/src/qwen35moe/qwen35moe_ffn.h b/server/src/qwen35moe/qwen35moe_ffn.h
@@ -15,7 +15,16 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
     ggml_context *        ctx,
     ggml_tensor *         cur,   // [hidden, n_tokens], post-attention normed
     const TargetWeights & w,
-    const TargetLayer &   L);
+    const TargetLayer &   L,
+    // Pass false when selected/weights are read back by the host instead of
+    // feeding an in-graph mul_mat_id: ggml_argsort_top_k returns a strided
+    // VIEW into the full [n_expert, n_tokens] argsort, and the hybrid raw
+    // readback (ggml_backend_tensor_get, packed [n_used x n_tokens]) then
+    // yields garbage expert ids for every token after the first (decode
+    // reads one row and is unaffected). ggml_top_k is contiguous by
+    // construction, and cheaper than a full argsort where the fusion
+    // cannot apply anyway.
+    bool                  allow_fused_router = true);
 
 ggml_tensor * build_qwen35moe_ffn(
     ggml_context *        ctx,