Commit 998408b
committed
fix MoRI dispatch corruption at the root: moriep.py overlay floors dispatch tokens to 256; remove harness env-clamp bandaid
Replaces the server_sglang.sh env clamp (a launch-script workaround) with a
library-level fix: an overlay of sglang's moriep.py that floors
num_max_dispatch_tokens_per_rank to 256 at its env read -- the single source of
truth feeding both the kernel selection and the mori buffer-sizing arg.
Root cause (adversarially verified against mori/sglang source): for single-node
EP8 the intra-node dispatch (DispatchIntraNodeBlock) writes received tokens into
a per-PE buffer sized MaxNumTokensToRecv() = worldSize * maxNumInpTokenPerRank
(dispatch_combine.hpp:126-136; max_total_recv_tokens defaults to 0 -> that
fallback, and it is a cap not a floor). The harness derives maxNumInpTokenPerRank
from max(CONC_LIST)/TP*(MTP+1), which collapses to 32 at conc-64/TP8/MTP3. The
per-dest atomic counter then overruns the buffer; the only guard is
assert(destTokId < MaxNumTokensToRecv()), compiled out under -DNDEBUG -> silent
out-of-bounds writes -> output that decodes fine (high acceptance length) but is
semantically garbage (gsm8k=0).
Delivery: patches/moriep.py is byte-identical to upstream v0.5.12.post1
(md5 ac626f5459...) plus a +22-line floor; auto-mounted by job.slurm via
EXTRA_DOCKER_MOUNTS, gated on the v0.5.12.post1 image tag (same mechanism as the
mori_conn.py overlay). Empirically validated on MI355X (conc-64 DEP8+MTP3):
dispatch 32 -> gsm8k 0.00, 64 -> 0.00 (one wavefront insufficient), 256 -> 0.94.
Throughput unchanged; the corrupt run's ~3% edge was dropped work.
Upstream issues: sgl-project/sglang#27194, ROCm/mori#356.1 parent 272be18 commit 998408b
5 files changed
Lines changed: 1200 additions & 29 deletions
File tree
- benchmarks/multi_node/amd_utils
- patches
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
82 | 103 | | |
83 | 104 | | |
84 | 105 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
63 | 102 | | |
64 | 103 | | |
65 | 104 | | |
| |||
0 commit comments