You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Match llama's qwen3moe graph: argsort_top_k lets ggml-cuda's topk-moe
fusion collapse softmax->topk->get_rows->norm into ~1 kernel, and
ggml_swiglu on the combined gate_up buffer drops the 2 ggml_cont copies
per layer (x30 MoE layers). Same selection -> bit-identical output.
Env-gated for A/B: DFLASH_NO_MOE_ROUTER_FUSE / DFLASH_NO_MOE_SWIGLU_FUSE.
Perf-neutral at all-hot (113.1 vs 113.3 tok/s, noise) — these router/
swiglu ops are <3% each; the residual ~3% decode gap vs llama is the
shared MoE GEMV (mul_mat_q, 58% at 16.7% occ) launch-bound floor, not
missed fusions. Lands graph-node parity; removes "missed fusion" as a
gap explanation.
0 commit comments