Skip to content

Commit 466be35

Browse files
ChangLiu0709cliu1004@amd.comcursoragentchunfangamdclaude[bot]
authored
[MoRI short term temp patch] GLM-5 FP8 MI355X SGLang disaggregated (#1572)
* Add GLM-5 FP8 MI355X SGLang disaggregated benchmark (PR-2). Introduce glm5-fp8-mi355x-sglang-disagg CI config, model server flags, launch script, setup_deps.sh image patches, and GLM-5 env tuning for MoRI PD disaggregation on MI355X. Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * Update benchmarks/multi_node/amd_utils/setup_deps.sh Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * fix: add FRAMEWORK to check_env_vars, fix NODELIST variable name - Add missing FRAMEWORK to check_env_vars list to match sister sglang-disagg scripts (dsr1_fp8, dsr1_fp4) - Rename NODE_LIST to NODELIST (quoted) to match the convention used by kimik2.5/minimaxm2.5 vllm-disagg sisters Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> * [Klaud Cold] GLM-5 disagg: port MoRI conn.py overlay to fix PD startup crash (#1578) sglang v0.5.12.post1 ships an unmigrated MoRI PD-disagg backend (legacy singular `state_type` + flat-int wire format) that crashes hybrid-attention models at PD-disagg startup. PR #1572's CI run (actions/runs/26544255929/job/78192785656) hit this exact failure: File ".../disaggregation/mori/conn.py", line 1424, in <genexpr> struct.pack("I", item_len) struct.error: required argument is not an integer Root cause: KVArgs.state_item_lens is List[List[int]] for any model with state_types: List[StateType] (Qwen3.5-MoE, GLM-5 NSA, etc.), but MoRI's `_register_kv_args` iterates and packs each element with naked `struct.pack("I", item_len)`, expecting flat List[int]. Two further sites in `send_state` and `_send_swa_dsa_state` have the same legacy-API assumption. This PR ports the validated overlay from Chun Fang's working branch (chun-chang/sglang-disagg-qwen3.5, commits 48e459b + c4e397d) and stacks it on top of PR #1572: - benchmarks/multi_node/amd_utils/patches/mori_conn.py (new, 1665 lines) Drop-in replacement for sglang v0.5.12.post1's disaggregation/mori/conn.py with four conservative patches: 1. Sender flatten — handle nested state_item_lens 2. state_type plural-API fallback (matches Mooncake/NIXL) 3. Consumer normalize state_item_lens at send_state entry 4. SWA/DSA rank+length normalize before group_concurrent_contiguous (fixes GLM-5 DSA single-component np.diff broadcast crash) - benchmarks/multi_node/amd_utils/patches/README.md (new) Bug analysis, when-to-use table, opt-out knob documentation. - benchmarks/multi_node/amd_utils/job.slurm (+25) Auto-bind-mount the overlay when DOCKER_IMAGE_NAME contains "v0.5.12.post1". Opt-out via MORI_CONN_PATCH=skip. Appends to ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts. - .github/configs/amd-master.yaml (+1/-1) Image bump v0.5.12-...-20260517 → v0.5.12.post1-...-20260523 to unlock the auto-apply gate (matches chun-chang lineage). - perf-changelog.yaml (+2/-1) Document the image bump and overlay rationale. Validated on chun-chang at the same image tag: glm5-fp8-mi355x-sglang-disagg GSM8K strict-match = 0.9712 ± 0.0046 glm5-fp8-mi355x-sglang-disagg GSM8K flexible-extract = 0.9704 ± 0.0047 Stop-gap until sglang migrates MoRI to the plural state_types API that Mooncake (mooncake/conn.py:912) and NIXL (nixl/conn.py:1381) already use. Tracks sgl-project/sglang#21886 and #22665. Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: ChangLiu0709 <cliu1004@amd.com> --------- Co-authored-by: cliu1004@amd.com <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: chunfangamd <chun.fang@amd.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent 7ae613c commit 466be35

10 files changed

Lines changed: 2102 additions & 21 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -553,6 +553,61 @@ glm5-fp8-mi355x-sglang-mtp:
553553
- { tp: 4, conc-start: 4, conc-end: 128, spec-decoding: mtp }
554554
- { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }
555555

556+
glm5-fp8-mi355x-sglang-disagg:
557+
image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
558+
model: zai-org/GLM-5-FP8
559+
model-prefix: glm5
560+
runner: mi355x-disagg
561+
precision: fp8
562+
framework: sglang-disagg
563+
multinode: true
564+
disagg: true
565+
scenarios:
566+
fixed-seq-len:
567+
- isl: 1024
568+
osl: 1024
569+
search-space:
570+
# 1P+1D TP8/EP1 CI smoke sweep (aligned with glm5-fp8-mi355x-sglang conc range)
571+
- spec-decoding: "none"
572+
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
573+
prefill:
574+
num-worker: 1
575+
tp: 8
576+
ep: 1
577+
dp-attn: false
578+
additional-settings:
579+
- "PREFILL_NODES=1"
580+
decode:
581+
num-worker: 1
582+
tp: 8
583+
ep: 1
584+
dp-attn: false
585+
additional-settings:
586+
- "DECODE_NODES=1"
587+
- "DECODE_MTP_SIZE=0"
588+
589+
- isl: 8192
590+
osl: 1024
591+
search-space:
592+
# 1P+1D TP8/EP1 CI smoke sweep; dp-attn false (NSA / MoRI path)
593+
- spec-decoding: "none"
594+
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
595+
prefill:
596+
num-worker: 1
597+
tp: 8
598+
ep: 1
599+
dp-attn: false
600+
additional-settings:
601+
- "PREFILL_NODES=1"
602+
decode:
603+
num-worker: 1
604+
tp: 8
605+
ep: 1
606+
dp-attn: false
607+
additional-settings:
608+
- "DECODE_NODES=1"
609+
- "DECODE_MTP_SIZE=0"
610+
556611
glm5-fp8-mi355x-atom:
557612
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
558613
model: zai-org/GLM-5-FP8

benchmarks/multi_node/amd_utils/env.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,13 @@ else
140140
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=3600
141141
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=3600
142142

143+
# GLM-5: uses NSA (not MLA), needs fused-decode-MLA disabled + fast loading
144+
if [[ "$MODEL_NAME" == "GLM-5-FP8" ]]; then
145+
export SGLANG_ROCM_FUSED_DECODE_MLA=0
146+
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
147+
export SAFETENSORS_FAST_GPU=1
148+
fi
149+
143150
# Disable allocating memory in one pass
144151
export MORI_SHMEM_MODE=ISOLATION
145152

benchmarks/multi_node/amd_utils/job.slurm

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,30 @@ echo "Runfile set: $RUN_FILE"
5555
# $(pwd) is amd_utils/ (the sbatch submit dir); go up 3 levels to reach the repo root.
5656
export DI_REPO_DIR=$(cd "$(pwd)/../../.." && pwd)
5757

58+
# ── In-tree sglang patches: auto-apply for known-affected images ──────
59+
# sglang v0.5.12.post1 ships a known-broken MoRI PD-disaggregation
60+
# backend that crashes hybrid-attention models (GLM-5, Qwen3.5-MoE,
61+
# anything with state_types: List[StateType]) at startup. We carry an
62+
# in-tree overlay of mori/conn.py that fixes the wire format + the
63+
# legacy state_type fallback (see patches/README.md for the bug
64+
# analysis and patch detail).
65+
#
66+
# Auto-applied when the image tag contains "v0.5.12.post1", unless the
67+
# caller sets MORI_CONN_PATCH=skip. The overlay is appended to
68+
# ${EXTRA_DOCKER_MOUNTS:-} so callers can still inject other mounts.
69+
# Dedup guard avoids double-mounting if EXTRA_DOCKER_MOUNTS already
70+
# contains the target path (docker rejects duplicate destinations).
71+
_MORI_PATCH_FILE="$DI_REPO_DIR/benchmarks/multi_node/amd_utils/patches/mori_conn.py"
72+
_MORI_PATCH_TARGET="/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py"
73+
if [[ "${MORI_CONN_PATCH:-auto}" != "skip" ]] \
74+
&& [[ -f "$_MORI_PATCH_FILE" ]] \
75+
&& [[ "${DOCKER_IMAGE_NAME:-}" == *"v0.5.12.post1"* ]] \
76+
&& [[ "${EXTRA_DOCKER_MOUNTS:-}" != *"$_MORI_PATCH_TARGET"* ]]; then
77+
EXTRA_DOCKER_MOUNTS="${EXTRA_DOCKER_MOUNTS:-} -v ${_MORI_PATCH_FILE}:${_MORI_PATCH_TARGET}:ro"
78+
export EXTRA_DOCKER_MOUNTS
79+
echo "[job.slurm] auto-applied MoRI conn.py overlay: ${_MORI_PATCH_FILE}"
80+
fi
81+
5882
xP="${xP:-1}"
5983
yD="${yD:-1}"
6084

@@ -465,6 +489,7 @@ fi
465489
-v /tmp:/run_logs \
466490
-v ${BENCHMARK_LOGS_DIR}:/benchmark_logs \
467491
-v ${DI_REPO_DIR}:${DOCKER_MOUNT_PATH} \
492+
${EXTRA_DOCKER_MOUNTS:-} \
468493
${DOCKER_ENV_COMMON[*]} \
469494
${DOCKER_ENV_ENGINE[*]} \
470495
--name \"$DOCKER_CONT_NAME\" \

benchmarks/multi_node/amd_utils/models.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,37 @@ Qwen3.5-397B-A17B-FP8:
192192
chunked_prefill_size: 262144
193193
cuda_graph_bs_range: "1-128"
194194

195+
GLM-5-FP8:
196+
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --disaggregation-transfer-backend mori --tool-call-parser glm47 --reasoning-parser glm45 --model-loader-extra-config '{\\\"enable_multithread_load\\\": true, \\\"num_threads\\\": 8}'"
197+
mtp_flags: ""
198+
dp_flags: "--moe-a2a-backend mori --enable-dp-attention --moe-dense-tp-size 1 --enable-dp-lm-head"
199+
prefill:
200+
mem_fraction_static: 0.8
201+
disable_radix_cache: true
202+
dp:
203+
max_running_requests: 24
204+
chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_PREFILL * PREFILL_TP_SIZE"
205+
cuda_graph_bs: "1 2 3"
206+
no_dp:
207+
max_running_requests: 128
208+
chunked_prefill_size: 262144
209+
cuda_graph_bs_range: "1-128"
210+
decode:
211+
mem_fraction_static: 0.85
212+
prefill_round_robin_balance: true
213+
dp:
214+
max_running_requests: 4096
215+
chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_DECODE * DECODE_TP_SIZE"
216+
cuda_graph_bs_range: "1-160"
217+
ep_only:
218+
max_running_requests: 256
219+
chunked_prefill_size: 262144
220+
cuda_graph_bs_range: "1-256"
221+
no_dp:
222+
max_running_requests: 128
223+
chunked_prefill_size: 262144
224+
cuda_graph_bs_range: "1-128"
225+
195226
DeepSeek-R1-0528-MXFP4-Preview:
196227
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --ep-dispatch-algorithm fake --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori"
197228
mtp_flags: "--speculative-algorithm NEXTN --speculative-eagle-topk 1"
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# In-tree sglang patches for the MoRI PD-disagg path
2+
3+
This directory carries small Python overlays that get bind-mounted over
4+
the upstream sglang source inside the docker container at runtime.
5+
They are needed because some sglang releases ship known bugs in the
6+
MoRI disaggregation backend that block our benchmark + accuracy
7+
configs.
8+
9+
The mount is wired through the `EXTRA_DOCKER_MOUNTS` env var that
10+
`job.slurm` consumes (an opt-in `${EXTRA_DOCKER_MOUNTS:-}` after the
11+
existing `-v` block). The local-test driver scripts under
12+
`scripts/sglang_disagg/` pre-set this env var to the path of the
13+
relevant overlay; CI runners that need the patch can do the same.
14+
15+
## `mori_conn.py`
16+
17+
Overlays
18+
`/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py`.
19+
20+
Source: forked from the file shipped in
21+
`lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523`
22+
(sglang [v0.5.12.post1](https://github.com/sgl-project/sglang/tree/v0.5.12.post1)).
23+
Four logical edits, all confined to `MoriKVReceiver.send_state`,
24+
`MoriKVReceiver._register_kv_args`, and
25+
`MoriKVReceiver._send_swa_dsa_state`:
26+
27+
1. **Sender flatten** — handle the framework's nested
28+
`state_item_lens: List[List[int]]` instead of crashing in the
29+
naked `struct.pack("I", item_len)` (the legacy `List[int]`
30+
assumption). Idempotent for legacy flat callers.
31+
2. **`state_type` legacy fallback** — when the legacy singular
32+
`kv_args.state_type` is `'none'` but `state_mem_descs` is non-empty,
33+
read `kv_args.state_types[0]` (the modern plural API that Mooncake
34+
and NIXL already use). Routes `MAMBA → _send_mamba_state` and
35+
`DSA/SWA → _send_swa_dsa_state` correctly.
36+
3. **Consumer normalization** — flatten `state_item_lens` and
37+
`state_dim_per_tensor` to flat `List[int]` once at the entry of
38+
`send_state`, so the existing per-tensor index arithmetic
39+
(`state_item_lens[i]`) and length checks
40+
(`len(state_item_lens) == len(state_mem_descs)`) keep working.
41+
4. **DSA index rank+length normalization** — inside
42+
`_send_swa_dsa_state`, before the `group_concurrent_contiguous`
43+
call, ravel both `src_state_indices` and `dst_state_indices` to 1-D
44+
and re-truncate to common length. Upstream's existing truncation
45+
only slices the outer axis, leaving 2-D `(1, N)` arrays unchanged
46+
and triggering an `np.diff` broadcasting error
47+
(`shapes (1,12) (0,)`) for GLM-5 (single-DSA-component) prefill
48+
traffic. See
49+
`scripts/sglang_disagg/docs_glm5/01-bug-analysis.md` for the full
50+
write-up.
51+
52+
Verified passing GSM8K = 0.978 ± 0.004 on Qwen3.5-397B-A17B-FP8 1P+1D
53+
TP=8 dp-attn=false (matches and slightly exceeds upstream
54+
[PR #22665](https://github.com/sgl-project/sglang/pull/22665)'s
55+
reported 0.970 GSM8K on the bf16 baseline). GLM-5 (DSA) verification
56+
in progress under
57+
`scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md`.
58+
59+
This is a stop-gap. The proper upstream fix is to migrate MoRI to the
60+
plural `state_types: List[StateType]` API (full design + diff in
61+
`scripts/sglang_disagg/docs/03-upstream-pr-proposal.md`).
62+
63+
## How to enable
64+
65+
```bash
66+
export EXTRA_DOCKER_MOUNTS="-v $DI_REPO_DIR/benchmarks/multi_node/amd_utils/patches/mori_conn.py:/sgl-workspace/sglang/python/sglang/srt/disaggregation/mori/conn.py:ro"
67+
```
68+
69+
`$DI_REPO_DIR` is the InferenceX checkout root that `job.slurm`
70+
already mounts into the container at `/workspace`.
71+
72+
When this env var is unset (CI default for runs that don't need the
73+
patch), `${EXTRA_DOCKER_MOUNTS:-}` expands to the empty string and
74+
container behavior is byte-identical to the unpatched path.
75+
76+
## When to use which patch
77+
78+
| Image / version | Need `mori_conn.py` overlay? |
79+
|---|---|
80+
| `lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523` | yes (Qwen3.5-MoE-FP8, GLM-5, any hybrid model on this image) |
81+
| `lmsysorg/sglang-rocm:v0.5.10.post1-rocm720-mi35x-*` (used by `dsr1-fp4-*-disagg`) | not validated; same code path likely affected — try with the overlay if you hit the same `struct.error` |
82+
| `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-*` (used by `dsr1-fp8-*-disagg`, `glm5-*-disagg`) | predates [PR #22665](https://github.com/sgl-project/sglang/pull/22665); different code paths; **do not** apply this overlay |
83+
84+
When upstream merges the proper fix (see
85+
`scripts/sglang_disagg/docs/03-upstream-pr-proposal.md`) and that
86+
fix lands in a published image, retire this overlay and the
87+
`EXTRA_DOCKER_MOUNTS` knob can stay (still useful for future patches).

0 commit comments

Comments
 (0)