Skip to content

Commit d5bf8ba

Browse files
davide221claude
andauthored
fix(qwen35moe): contiguous top_k router ids when host-read (hybrid spark crash) (#481)
Regression from #472: ggml_argsort_top_k returns a strided VIEW into the full [n_expert, n_tokens] argsort. The hybrid hot/cold path reads the router outputs back with a raw packed ggml_backend_tensor_get (qwen35moe_backend.cpp prefill chunk readback), which silently yields garbage expert ids for every token after the first -> corrupted hot/cold dispatch -> CUDA illegal memory access on the first multi-token spark prefill. Decode reads a single row and was unaffected, which is why #472 validation (all-hot + decode) passed. Scoped fix: build_qwen35moe_router(allow_fused_router=false) at the hybrid export site emits contiguous ggml_top_k ids. The in-graph FFN path keeps argsort_top_k and its CUDA topk-moe fusion (all-hot output verified byte-identical, hash 8cda68b0ca5eb797, fused and unfused). Validated on lucebox2 (3090): spark repro 2x clean + coherent (crashed 100% before), all-hot long prefill byte-identical, unit suite 2050 assertions 0 failures. Co-authored-by: Claude <noreply@anthropic.com>
1 parent 13ac209 commit d5bf8ba

3 files changed

Lines changed: 20 additions & 4 deletions

File tree

server/src/qwen35/qwen35_target_graph.cpp

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1417,7 +1417,13 @@ QwenLayerPrefnOutputs build_qwen35_layer_prefn(
14171417
out.residual = cur;
14181418
out.post = rms_norm_mul(ctx, cur, L.attn_post_norm, eps);
14191419
if (w.is_moe) {
1420-
Qwen35MoeRouterOutputs router = build_qwen35moe_router(ctx, out.post, w, L);
1420+
// selected/weights are read back by the host (hybrid hot/cold expert
1421+
// compute), not consumed in-graph. argsort_top_k yields a strided view
1422+
// whose raw packed readback returns garbage ids for tokens > 0 (crash
1423+
// in expert dispatch on the first multi-token prefill); top_k is
1424+
// contiguous, and cheaper than a full argsort here.
1425+
Qwen35MoeRouterOutputs router = build_qwen35moe_router(
1426+
ctx, out.post, w, L, /*allow_fused_router=*/false);
14211427
out.moe_selected = router.selected;
14221428
out.moe_weights = router.weights;
14231429
}

server/src/qwen35moe/qwen35moe_ffn.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
1111
ggml_context * ctx,
1212
ggml_tensor * cur,
1313
const TargetWeights & w,
14-
const TargetLayer & L) {
14+
const TargetLayer & L,
15+
bool allow_fused_router) {
1516
const int n_tokens = (int)cur->ne[1];
1617
const int n_expert = w.n_expert;
1718
const int n_used = w.n_expert_used;
@@ -35,7 +36,7 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
3536
// x30 MoE layers (the launch-bound decode gap vs llama, which uses argsort_top_k).
3637
// Same top-k selection -> bit-identical. DFLASH_NO_MOE_ROUTER_FUSE=1 = old path.
3738
static const bool router_fuse = (std::getenv("DFLASH_NO_MOE_ROUTER_FUSE") == nullptr);
38-
ggml_tensor * selected = router_fuse
39+
ggml_tensor * selected = (router_fuse && allow_fused_router)
3940
? ggml_argsort_top_k(ctx, probs, n_used)
4041
: ggml_top_k(ctx, probs, n_used);
4142

server/src/qwen35moe/qwen35moe_ffn.h

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,16 @@ Qwen35MoeRouterOutputs build_qwen35moe_router(
1515
ggml_context * ctx,
1616
ggml_tensor * cur, // [hidden, n_tokens], post-attention normed
1717
const TargetWeights & w,
18-
const TargetLayer & L);
18+
const TargetLayer & L,
19+
// Pass false when selected/weights are read back by the host instead of
20+
// feeding an in-graph mul_mat_id: ggml_argsort_top_k returns a strided
21+
// VIEW into the full [n_expert, n_tokens] argsort, and the hybrid raw
22+
// readback (ggml_backend_tensor_get, packed [n_used x n_tokens]) then
23+
// yields garbage expert ids for every token after the first (decode
24+
// reads one row and is unaffected). ggml_top_k is contiguous by
25+
// construction, and cheaper than a full argsort where the fusion
26+
// cannot apply anyway.
27+
bool allow_fused_router = true);
1928

2029
ggml_tensor * build_qwen35moe_ffn(
2130
ggml_context * ctx,

0 commit comments

Comments
 (0)