Skip to content

Commit 998408b

Browse files
committed
fix MoRI dispatch corruption at the root: moriep.py overlay floors dispatch tokens to 256; remove harness env-clamp bandaid
Replaces the server_sglang.sh env clamp (a launch-script workaround) with a library-level fix: an overlay of sglang's moriep.py that floors num_max_dispatch_tokens_per_rank to 256 at its env read -- the single source of truth feeding both the kernel selection and the mori buffer-sizing arg. Root cause (adversarially verified against mori/sglang source): for single-node EP8 the intra-node dispatch (DispatchIntraNodeBlock) writes received tokens into a per-PE buffer sized MaxNumTokensToRecv() = worldSize * maxNumInpTokenPerRank (dispatch_combine.hpp:126-136; max_total_recv_tokens defaults to 0 -> that fallback, and it is a cap not a floor). The harness derives maxNumInpTokenPerRank from max(CONC_LIST)/TP*(MTP+1), which collapses to 32 at conc-64/TP8/MTP3. The per-dest atomic counter then overruns the buffer; the only guard is assert(destTokId < MaxNumTokensToRecv()), compiled out under -DNDEBUG -> silent out-of-bounds writes -> output that decodes fine (high acceptance length) but is semantically garbage (gsm8k=0). Delivery: patches/moriep.py is byte-identical to upstream v0.5.12.post1 (md5 ac626f5459...) plus a +22-line floor; auto-mounted by job.slurm via EXTRA_DOCKER_MOUNTS, gated on the v0.5.12.post1 image tag (same mechanism as the mori_conn.py overlay). Empirically validated on MI355X (conc-64 DEP8+MTP3): dispatch 32 -> gsm8k 0.00, 64 -> 0.00 (one wavefront insufficient), 256 -> 0.94. Throughput unchanged; the corrupt run's ~3% edge was dropped work. Upstream issues: sgl-project/sglang#27194, ROCm/mori#356.
1 parent 272be18 commit 998408b

5 files changed

Lines changed: 1200 additions & 29 deletions

File tree

benchmarks/multi_node/amd_utils/job.slurm

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,27 @@ if [[ "${MORI_CONN_PATCH:-auto}" != "skip" ]] \
7979
echo "[job.slurm] auto-applied MoRI conn.py overlay: ${_MORI_PATCH_FILE}"
8080
fi
8181

82+
# ── MoRI dispatch-buffer corruption fix: moriep.py overlay ────────────
83+
# sglang v0.5.12.post1 silently corrupts the MoRI EP dispatch path when the
84+
# per-rank dispatch buffer (num_max_dispatch_tokens_per_rank) is small: the
85+
# receive buffer is sized worldSize*maxNumInpTokenPerRank and the only overflow
86+
# guard is an assert() compiled out in release builds, so low concurrency
87+
# (e.g. conc-64 DEP8+MTP3 -> 32 tokens) yields out-of-bounds writes and gsm8k=0.
88+
# The overlay floors num_max_dispatch_tokens_per_rank to 256 at its env read
89+
# (the single source of truth for kernel selection + buffer sizing). The base
90+
# file is byte-identical to upstream v0.5.12.post1 (md5 ac626f5459...), so the
91+
# overlay is a +22-line diff. See patches/README.md and sgl-project/sglang#27194.
92+
_MORIEP_PATCH_FILE="$DI_REPO_DIR/benchmarks/multi_node/amd_utils/patches/moriep.py"
93+
_MORIEP_PATCH_TARGET="/sgl-workspace/sglang/python/sglang/srt/layers/moe/token_dispatcher/moriep.py"
94+
if [[ "${MORIEP_PATCH:-auto}" != "skip" ]] \
95+
&& [[ -f "$_MORIEP_PATCH_FILE" ]] \
96+
&& [[ "${DOCKER_IMAGE_NAME:-}" == *"v0.5.12.post1"* ]] \
97+
&& [[ "${EXTRA_DOCKER_MOUNTS:-}" != *"$_MORIEP_PATCH_TARGET"* ]]; then
98+
EXTRA_DOCKER_MOUNTS="${EXTRA_DOCKER_MOUNTS:-} -v ${_MORIEP_PATCH_FILE}:${_MORIEP_PATCH_TARGET}:ro"
99+
export EXTRA_DOCKER_MOUNTS
100+
echo "[job.slurm] auto-applied MoRI moriep.py dispatch-floor overlay: ${_MORIEP_PATCH_FILE}"
101+
fi
102+
82103
xP="${xP:-1}"
83104
yD="${yD:-1}"
84105

benchmarks/multi_node/amd_utils/patches/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,45 @@ This is a stop-gap. The proper upstream fix is to migrate MoRI to the
6060
plural `state_types: List[StateType]` API (full design + diff in
6161
`scripts/sglang_disagg/docs/03-upstream-pr-proposal.md`).
6262

63+
## `moriep.py`
64+
65+
Overlays
66+
`/sgl-workspace/sglang/python/sglang/srt/layers/moe/token_dispatcher/moriep.py`.
67+
68+
Source: forked from `lmsysorg/sglang-rocm:v0.5.12.post1-*` (sglang
69+
[v0.5.12.post1](https://github.com/sgl-project/sglang/tree/v0.5.12.post1)).
70+
The base file is **byte-identical to the upstream tag**
71+
(`md5 ac626f5459a699f9ac953d9d8e71d861`); the overlay is a single
72+
+22-line insertion in `MoriTokenDispatcher.__init__`.
73+
74+
**Bug it fixes:** at low concurrency the MoRI EP dispatch path silently
75+
corrupts output (decodes fine, acceptance length stays high, but gsm8k
76+
drops to 0). The per-rank dispatch buffer
77+
`num_max_dispatch_tokens_per_rank` (→ mori `max_num_inp_token_per_rank`)
78+
is derived by the harness as `max(CONC_LIST)/TP*(MTP+1)`, which collapses
79+
at low conc (conc-64 / TP8 / MTP3 → `64/8*4 = 32`). MoRI sizes its
80+
receive buffer `MaxNumTokensToRecv() = worldSize * maxNumInpTokenPerRank`
81+
(`max_total_recv_tokens` defaults to 0 → that fallback, and it is a *cap*
82+
not a floor — `dispatch_combine.hpp:126-136`). The intra-node dispatch
83+
kernel's per-dest atomic counter then runs past that buffer; the only
84+
guard is `assert(destTokId < MaxNumTokensToRecv())`, compiled out under
85+
`-DNDEBUG`, so the result is silent out-of-bounds writes
86+
(`internode_v1.cpp` `DispatchIntraNodeBlock`).
87+
88+
The overlay floors `num_max_dispatch_tokens_per_rank` to **256** right at
89+
its env read — the single source of truth that feeds both
90+
`get_ep_dispatch_configs()` (kernel selection) and the buffer-sizing
91+
arg. Empirically validated on MI355X (conc-64 DEP8+MTP3):
92+
dispatch `32 → gsm8k 0.00`, `64 → 0.00` (one wavefront is not enough),
93+
`256 → 0.94`.
94+
95+
This is a stop-gap. The proper upstream fix is in MoRI: size the receive
96+
buffer from the routing fan-in and turn the compiled-out `assert` into a
97+
real bounds guard (see [ROCm/mori#356](https://github.com/ROCm/mori/issues/356)).
98+
The integration-level guard belongs in sglang's `moriep.py`
99+
([sgl-project/sglang#27194](https://github.com/sgl-project/sglang/issues/27194)) —
100+
this overlay is exactly that guard, pending upstream merge.
101+
63102
## How to enable
64103

65104
```bash

0 commit comments

Comments
 (0)