qwen3.5-fp8-mi355x-sglang-disagg: bump image and disable dp-attn

chunfangamd · chunfangamd · commit 688ebe60f673 · 2026-05-23T22:31:13.000Z
Two YAML changes for this config row:

* image: lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511
        -&gt; lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
  Brings this entry onto the same rocm720 / mi35x lineage every other
  mi355x sglang config in this file already uses; image is also
  proven to support disagg (matches rocm720 base of dsr1-fp4-mi355x-
  sglang-disagg).

* 8k1k row: prefill.dp-attn / decode.dp-attn  true -&gt; false
  With --enable-dp-attention + --moe-a2a-backend mori, sglang
  auto-promotes moe_ep_size=tp_size=8 (log line: "MoRI MoE is
  enabled. The expert parallel size is adjusted to be the same as
  the tensor parallel size[8]"). is_deepep_class_backend() does NOT
  include MoRI, so num_shared_slots stays at the global value (1)
  rather than the per-rank num_fused_shared_experts*moe_ep_size = 8,
  and the assertion
      (num_experts - num_shared_slots) % self.moe_ep_size == 0
  in fused_moe_triton/layer.py fires for Qwen3.5 (512 routed +
  1 shared, ep=8): (512 - 1) % 8 = 7. Setting dp-attn=false leaves
  moe_ep_size=1, so (512 - 1) % 1 = 0 always.

  The 1k1k row was already at dp-attn=false; this aligns the 8k1k
  row with it. Comment block above each row records the dependency
  on the upstream sglang fix (add MoRI to is_deepep_class_backend()
  or reconcile shared-slot accounting); flip back once that lands.

Together with the MoRI conn.py overlay (commit &lt;SHA-1&gt;), the CI
matrix for this entry passes:

  smoke benchmark, 1k1k 1P+1D TP=8/EP=1 dp-attn=false, conc 8..256:
    request_throughput  0.85 -&gt; 7.64 req/s
    output_throughput    787 -&gt; 7042 tok/s
  smoke benchmark, 8k1k same topology, conc 8..256:
    request_throughput  0.84 -&gt; 7.09 req/s
    output_throughput    774 -&gt; 6537 tok/s
    total_throughput   6884 -&gt; 58818 tok/s
  accuracy (gsm8k 5-shot, conc=128, 8k1k):
    exact_match (strict)  0.978 +/- 0.004   PASS
    exact_match (flex)    0.978 +/- 0.004   PASS

(conc=512 stalls in MoRI's high-concurrency tail-deadlock; tracked
separately, distinct from the registration/state-type bugs.)
diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
@@ -349,7 +349,7 @@ qwen3.5-fp4-mi355x-atom:
       - { tp: 4, conc-start: 4, conc-end: 16 }
 
 qwen3.5-fp8-mi355x-sglang-disagg:
-  image: lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511
+  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
   model: Qwen/Qwen3.5-397B-A17B-FP8
   model-prefix: qwen3.5
   runner: mi355x-disagg
@@ -362,7 +362,16 @@ qwen3.5-fp8-mi355x-sglang-disagg:
     - isl: 1024
       osl: 1024
       search-space:
-      # Matches qwen3.5-fp8-mi355x-sglang TP8/EP1 low-concurrency sweep
+      # 1P+1D TP8/EP1 low-concurrency sweep.
+      # dp-attn intentionally false (matches the 1k1k row): with
+      # --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes
+      # moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI,
+      # so num_shared_slots stays at the global value (1) and the
+      # (num_experts - num_shared_slots) % moe_ep_size assertion in
+      # fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared).
+      # Track upstream sglang for a fix; flip back to dp-attn=true once
+      # MoRI is added to is_deepep_class_backend() or shared-slot
+      # accounting is reconciled.
       - spec-decoding: "none"
         conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
         prefill:
@@ -384,21 +393,30 @@ qwen3.5-fp8-mi355x-sglang-disagg:
     - isl: 8192
       osl: 1024
       search-space:
-      # Matches qwen3.5-fp8-mi355x-sglang TP2/EP2 low-concurrency sweep
+      # 1P+1D TP8/EP1 low-concurrency sweep.
+      # dp-attn intentionally false (matches the 1k1k row): with
+      # --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes
+      # moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI,
+      # so num_shared_slots stays at the global value (1) and the
+      # (num_experts - num_shared_slots) % moe_ep_size assertion in
+      # fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared).
+      # Track upstream sglang for a fix; flip back to dp-attn=true once
+      # MoRI is added to is_deepep_class_backend() or shared-slot
+      # accounting is reconciled.
       - spec-decoding: "none"
         conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
         prefill:
           num-worker: 1
           tp: 8
           ep: 1
-          dp-attn: true
+          dp-attn: false
           additional-settings:
           - "PREFILL_NODES=1"
         decode:
           num-worker: 1
           tp: 8
           ep: 1
-          dp-attn: true
+          dp-attn: false
           additional-settings:
           - "DECODE_NODES=1"
           - "DECODE_MTP_SIZE=0"