glm5-fp8-mi355x-sglang-disagg: bump to v0.5.12.post1 image and patch DSA state-index path

chunfangamd · chunfangamd · commit c4e397de9c4c · 2026-05-24T08:52:07.000Z
amd-master.yaml - Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402 -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is no longer the reference build for hybrid-attention disagg models on MI355X.) - Scenarios: collapse the four legacy "top/middle/bottom/small-scale" search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512] for both 1k1k and 8k1k. dp-attn=false avoids the fused_moe_triton/layer.py:209 shared-slot assertion that --enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5 (256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the same CI matrix-expansion logic applies to both. patches/mori_conn.py - Add patch #4: rank + length normalization in MoriKVReceiver._send_swa_dsa_state, immediately before the group_concurrent_contiguous call. For GLM-5 (single DSA component), upstream hands dst_state_indices as a 2-D (1, N) array while src_state_indices is 1-D length 1; the existing [:common_len] slice operates only on the outer axis, leaving the rank mismatched. np.diff then produces (1, N-1) vs (0,), which can't broadcast and crashes with "operands could not be broadcast together with shapes (1,12) (0,)". The fix ravels both indices to 1-D and re-truncates to common length so np.diff outputs compatible 1-D arrays. One-shot log gates the warning to once per receiver class. - Verified end-to-end: glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047 glm5-fp8-mi355x-sglang-disagg gsm8k strict-match = 0.9712 +/- 0.0046 qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression) = 0.9780 +/- 0.004 Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives inside _send_swa_dsa_state, never called for Mamba); patches #1-#3 behavior is unchanged. patches/README.md - Document patch #4 alongside the existing three. Cross-link the full bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md and the gsm8k verification at scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
@@ -481,7 +481,7 @@ glm5-fp8-mi355x-sglang-mtp:
       - { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }
 
 glm5-fp8-mi355x-sglang-disagg:
-  image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
+  image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
   model: zai-org/GLM-5-FP8
   model-prefix: glm5
   runner: mi355x-disagg
@@ -494,75 +494,15 @@ glm5-fp8-mi355x-sglang-disagg:
     - isl: 1024
       osl: 1024
       search-space:
-      # "Top of curve" (1 prefill worker at TP8 and 1 decode worker at DEP16 across 2 nodes)
       - spec-decoding: "none"
-        conc-list: [ 1024, 2048 ]
+        conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
         prefill:
           num-worker: 1
           tp: 8
           ep: 1
           dp-attn: false
           additional-settings:
           - "PREFILL_NODES=1"
-        decode:
-          num-worker: 1
-          tp: 8
-          ep: 8
-          dp-attn: true
-          additional-settings:
-          - "DECODE_NODES=2"
-          - "DECODE_MTP_SIZE=0"
-
-      # "Middle of curve" (1 prefill worker at TP8 and 2 decode workers at DEP8)
-      - spec-decoding: "none"
-        conc-list: [ 1536, 1024, 512 ]
-        prefill:
-          num-worker: 1
-          tp: 8
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "PREFILL_NODES=1"
-        decode:
-          num-worker: 2
-          tp: 8
-          ep: 8
-          dp-attn: true
-          additional-settings:
-          - "DECODE_NODES=2"
-          - "DECODE_MTP_SIZE=0"
-
-      # "Bottom of curve" (1 prefill worker at TP8 and 2 decode workers at TP8)
-      - spec-decoding: "none"
-        conc-list: [ 256, 128, 64, 32, 16, 8, 4 ]
-        prefill:
-          num-worker: 1
-          tp: 8
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "PREFILL_NODES=1"
-
-        decode:
-          num-worker: 2
-          tp: 8
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "DECODE_NODES=2"
-          - "DECODE_MTP_SIZE=0"
-
-      # "Small scale" (1 prefill worker at TP4 and 1 decode worker at TP8)
-      - spec-decoding: "none"
-        conc-list: [ 64, 32, 16, 8, 4, 2, 1 ]
-        prefill:
-          num-worker: 1
-          tp: 4
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "PREFILL_NODES=1"
-
         decode:
           num-worker: 1
           tp: 8
@@ -575,56 +515,15 @@ glm5-fp8-mi355x-sglang-disagg:
     - isl: 8192
       osl: 1024
       search-space:
-      # "Top of curve" (2 prefill workers at DEP8 and 1 decode worker at DEP8)
-      - spec-decoding: "none"
-        conc-list: [ 1024, 2048 ]
-        prefill:
-          num-worker: 2
-          tp: 8
-          ep: 8
-          dp-attn: true
-          additional-settings:
-          - "PREFILL_NODES=2"
-        decode:
-          num-worker: 1
-          tp: 8
-          ep: 8
-          dp-attn: true
-          additional-settings:
-          - "DECODE_NODES=1"
-          - "DECODE_MTP_SIZE=0"
-
-      # "Bottom of curve" (1 prefill worker at TP8 and 2 decode workers at TP8)
       - spec-decoding: "none"
-        conc-list: [ 256, 128, 64, 32, 16, 8, 4 ]
+        conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
         prefill:
           num-worker: 1
           tp: 8
           ep: 1
           dp-attn: false
           additional-settings:
           - "PREFILL_NODES=1"
-
-        decode:
-          num-worker: 2
-          tp: 8
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "DECODE_NODES=2"
-          - "DECODE_MTP_SIZE=0"
-
-      # "Small scale" (1 prefill worker at TP4 and 1 decode worker at TP8)
-      - spec-decoding: "none"
-        conc-list: [ 64, 32, 16, 8, 4, 2, 1 ]
-        prefill:
-          num-worker: 1
-          tp: 4
-          ep: 1
-          dp-attn: false
-          additional-settings:
-          - "PREFILL_NODES=1"
-
         decode:
           num-worker: 1
           tp: 8
diff --git a/benchmarks/multi_node/amd_utils/patches/README.md b/benchmarks/multi_node/amd_utils/patches/README.md
@@ -20,8 +20,9 @@ Overlays
 Source: forked from the file shipped in
 `lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523`
 (sglang [v0.5.12.post1](https://github.com/sgl-project/sglang/tree/v0.5.12.post1)).
-Three logical edits, all confined to `MoriKVReceiver.send_state` and
-`MoriKVReceiver._register_kv_args`:
+Four logical edits, all confined to `MoriKVReceiver.send_state`,
+`MoriKVReceiver._register_kv_args`, and
+`MoriKVReceiver._send_swa_dsa_state`:
 
 1. **Sender flatten** — handle the framework's nested
    `state_item_lens: List[List[int]]` instead of crashing in the
@@ -37,11 +38,23 @@ Three logical edits, all confined to `MoriKVReceiver.send_state` and
    `send_state`, so the existing per-tensor index arithmetic
    (`state_item_lens[i]`) and length checks
    (`len(state_item_lens) == len(state_mem_descs)`) keep working.
+4. **DSA index rank+length normalization** — inside
+   `_send_swa_dsa_state`, before the `group_concurrent_contiguous`
+   call, ravel both `src_state_indices` and `dst_state_indices` to 1-D
+   and re-truncate to common length. Upstream's existing truncation
+   only slices the outer axis, leaving 2-D `(1, N)` arrays unchanged
+   and triggering an `np.diff` broadcasting error
+   (`shapes (1,12) (0,)`) for GLM-5 (single-DSA-component) prefill
+   traffic. See
+   `scripts/sglang_disagg/docs_glm5/01-bug-analysis.md` for the full
+   write-up.
 
 Verified passing GSM8K = 0.978 ± 0.004 on Qwen3.5-397B-A17B-FP8 1P+1D
 TP=8 dp-attn=false (matches and slightly exceeds upstream
 [PR #22665](https://github.com/sgl-project/sglang/pull/22665)'s
-reported 0.970 GSM8K on the bf16 baseline).
+reported 0.970 GSM8K on the bf16 baseline). GLM-5 (DSA) verification
+in progress under
+`scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md`.
 
 This is a stop-gap. The proper upstream fix is to migrate MoRI to the
 plural `state_types: List[StateType]` API (full design + diff in
diff --git a/benchmarks/multi_node/amd_utils/patches/mori_conn.py b/benchmarks/multi_node/amd_utils/patches/mori_conn.py
@@ -1148,6 +1148,37 @@ def _send_swa_dsa_state(
             src_state_indices = src_state_indices[:common_len]
             dst_state_indices = dst_state_indices[:common_len]
 
+        # ── BEGIN PATCH #4: rank + length normalization ──────────────────
+        # Bug: for DSA single-component models (e.g. GLM-5), upstream may
+        # hand us `dst_state_indices` as a 2-D array of shape (1, N) or
+        # as a List[int]/List[np.ndarray]. The earlier `[:common_len]`
+        # slice operates only on the outer axis, so a (1, 13) input stays
+        # (1, 13). `group_concurrent_contiguous` then runs `np.diff` on
+        # arrays of mismatched rank ((1, N-1) vs (0,)) and crashes with
+        # "operands could not be broadcast together with shapes (1,12) (0,)".
+        # Flatten both sides to 1-D and re-align lengths so np.diff produces
+        # 1-D arrays of equal length.
+        src_state_indices = np.asarray(src_state_indices).ravel()
+        dst_state_indices = np.asarray(dst_state_indices).ravel()
+        if len(src_state_indices) != len(dst_state_indices):
+            new_common = min(len(src_state_indices), len(dst_state_indices))
+            if not getattr(self.__class__, "_logged_dsa_index_flatten", False):
+                try:
+                    logger.warning(
+                        "[mori-patch] DSA state-indices ravel/realign for %s: "
+                        "src=%d dst=%d -> common=%d (one-shot log)",
+                        state_type,
+                        len(src_state_indices),
+                        len(dst_state_indices),
+                        new_common,
+                    )
+                except Exception:
+                    pass
+                self.__class__._logged_dsa_index_flatten = True
+            src_state_indices = src_state_indices[:new_common]
+            dst_state_indices = dst_state_indices[:new_common]
+        # ── END PATCH #4 ─────────────────────────────────────────────────
+
         # Group contiguous indices and issue per-tensor transfers
         grouped_plan = GroupedIndexPlan.from_groups(
             *group_concurrent_contiguous(src_state_indices, dst_state_indices)