Skip to content

Commit 688ebe6

Browse files
committed
qwen3.5-fp8-mi355x-sglang-disagg: bump image and disable dp-attn
Two YAML changes for this config row: * image: lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511 -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 Brings this entry onto the same rocm720 / mi35x lineage every other mi355x sglang config in this file already uses; image is also proven to support disagg (matches rocm720 base of dsr1-fp4-mi355x- sglang-disagg). * 8k1k row: prefill.dp-attn / decode.dp-attn true -> false With --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes moe_ep_size=tp_size=8 (log line: "MoRI MoE is enabled. The expert parallel size is adjusted to be the same as the tensor parallel size[8]"). is_deepep_class_backend() does NOT include MoRI, so num_shared_slots stays at the global value (1) rather than the per-rank num_fused_shared_experts*moe_ep_size = 8, and the assertion (num_experts - num_shared_slots) % self.moe_ep_size == 0 in fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared, ep=8): (512 - 1) % 8 = 7. Setting dp-attn=false leaves moe_ep_size=1, so (512 - 1) % 1 = 0 always. The 1k1k row was already at dp-attn=false; this aligns the 8k1k row with it. Comment block above each row records the dependency on the upstream sglang fix (add MoRI to is_deepep_class_backend() or reconcile shared-slot accounting); flip back once that lands. Together with the MoRI conn.py overlay (commit <SHA-1>), the CI matrix for this entry passes: smoke benchmark, 1k1k 1P+1D TP=8/EP=1 dp-attn=false, conc 8..256: request_throughput 0.85 -> 7.64 req/s output_throughput 787 -> 7042 tok/s smoke benchmark, 8k1k same topology, conc 8..256: request_throughput 0.84 -> 7.09 req/s output_throughput 774 -> 6537 tok/s total_throughput 6884 -> 58818 tok/s accuracy (gsm8k 5-shot, conc=128, 8k1k): exact_match (strict) 0.978 +/- 0.004 PASS exact_match (flex) 0.978 +/- 0.004 PASS (conc=512 stalls in MoRI's high-concurrency tail-deadlock; tracked separately, distinct from the registration/state-type bugs.)
1 parent 48e459b commit 688ebe6

1 file changed

Lines changed: 23 additions & 5 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,7 @@ qwen3.5-fp4-mi355x-atom:
349349
- { tp: 4, conc-start: 4, conc-end: 16 }
350350

351351
qwen3.5-fp8-mi355x-sglang-disagg:
352-
image: lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511
352+
image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
353353
model: Qwen/Qwen3.5-397B-A17B-FP8
354354
model-prefix: qwen3.5
355355
runner: mi355x-disagg
@@ -362,7 +362,16 @@ qwen3.5-fp8-mi355x-sglang-disagg:
362362
- isl: 1024
363363
osl: 1024
364364
search-space:
365-
# Matches qwen3.5-fp8-mi355x-sglang TP8/EP1 low-concurrency sweep
365+
# 1P+1D TP8/EP1 low-concurrency sweep.
366+
# dp-attn intentionally false (matches the 1k1k row): with
367+
# --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes
368+
# moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI,
369+
# so num_shared_slots stays at the global value (1) and the
370+
# (num_experts - num_shared_slots) % moe_ep_size assertion in
371+
# fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared).
372+
# Track upstream sglang for a fix; flip back to dp-attn=true once
373+
# MoRI is added to is_deepep_class_backend() or shared-slot
374+
# accounting is reconciled.
366375
- spec-decoding: "none"
367376
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
368377
prefill:
@@ -384,21 +393,30 @@ qwen3.5-fp8-mi355x-sglang-disagg:
384393
- isl: 8192
385394
osl: 1024
386395
search-space:
387-
# Matches qwen3.5-fp8-mi355x-sglang TP2/EP2 low-concurrency sweep
396+
# 1P+1D TP8/EP1 low-concurrency sweep.
397+
# dp-attn intentionally false (matches the 1k1k row): with
398+
# --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes
399+
# moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI,
400+
# so num_shared_slots stays at the global value (1) and the
401+
# (num_experts - num_shared_slots) % moe_ep_size assertion in
402+
# fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared).
403+
# Track upstream sglang for a fix; flip back to dp-attn=true once
404+
# MoRI is added to is_deepep_class_backend() or shared-slot
405+
# accounting is reconciled.
388406
- spec-decoding: "none"
389407
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
390408
prefill:
391409
num-worker: 1
392410
tp: 8
393411
ep: 1
394-
dp-attn: true
412+
dp-attn: false
395413
additional-settings:
396414
- "PREFILL_NODES=1"
397415
decode:
398416
num-worker: 1
399417
tp: 8
400418
ep: 1
401-
dp-attn: true
419+
dp-attn: false
402420
additional-settings:
403421
- "DECODE_NODES=1"
404422
- "DECODE_MTP_SIZE=0"

0 commit comments

Comments
 (0)