Skip to content

Commit 272be18

Browse files
committed
raise MoRI dispatch-buffer floor to 256 (warpSize=64 proven insufficient)
The conc-64 run with the warpSize floor (64) still scored gsm8k=0.00 (run 26919517564), disproving the one-wavefront hypothesis. The per-rank dispatch buffer must hold the routing fan-in (a receiving rank takes tokens from all worldSize peers), not just one warp-chunk. Empirically on MI355X: dispatch=32 -> 0.00, dispatch=64 -> 0.00, dispatch>=256 -> 0.94. Clamp to the proven 256. Throughput is unchanged; the corrupt run's ~3% edge was dropped work, not real speed.
1 parent 04ceb08 commit 272be18

2 files changed

Lines changed: 25 additions & 15 deletions

File tree

benchmarks/multi_node/amd_utils/server_sglang.sh

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -248,22 +248,32 @@ if [[ "$DECODE_MTP_SIZE" -gt 0 ]]; then
248248
MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MOE_MAX_INPUT_TOKENS_DECODE * (DECODE_MTP_SIZE + 1)))
249249
fi
250250

251-
# ── MoRI dispatch-buffer warpSize floor ──────────────────────────────────────
251+
# ── MoRI dispatch-buffer minimum floor ──────────────────────────────────────
252252
# The MoRI All2All dispatch kernel (EpDispatchInterNodeV1Kernel / IntraNode)
253-
# places dispatched tokens in warpSize-aligned receive slots:
254-
# destTokId = flagSlotId * warpSize + laneId (laneId = 0..warpSize-1)
255-
# i.e. each warp writes up to warpSize=64 (CDNA3/4 wavefront) token slots per
256-
# chunk. The per-rank receive region is sized to maxNumInpTokenPerRank, which
257-
# the harness derives from max(CONC_LIST)/TP*(MTP+1). At low concurrency this
258-
# collapses below 64 (e.g. conc-64 / TP8 / MTP3 -> 64/8*4 = 32), so a single
259-
# warp-chunk overruns the 32-slot region -> silent out-of-bounds writes ->
260-
# semantically corrupt output (decodes fine, gsm8k = 0.0). Clamp to one
261-
# wavefront so the slot arithmetic is always in-bounds. This only raises the
262-
# value at low conc (high conc is naturally larger); it adds a few MB of
263-
# staging buffer but no compute, so real throughput is unchanged.
264-
MORI_DISPATCH_TOKENS_FLOOR=64
253+
# silently corrupts output when the per-rank dispatch buffer
254+
# (maxNumInpTokenPerRank = SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK) is too
255+
# small. The harness derives that value from max(CONC_LIST)/TP*(MTP+1), which
256+
# collapses at low concurrency (conc-64 / TP8 / MTP3 -> 64/8*4 = 32). Two things
257+
# break: (1) the kernel writes tokens in warpSize-aligned chunks
258+
# (destTokId = flagSlotId*warpSize + laneId, laneId 0..63), so a buffer < 64
259+
# can't even hold one wavefront; (2) a receiving rank takes tokens from all
260+
# `worldSize` peers, so the per-rank buffer must hold the routing fan-in, not
261+
# just the local token count. The result is out-of-bounds receive-slot writes
262+
# -> output that decodes fine (acceptance length stays high) but is semantically
263+
# garbage (gsm8k = 0.0).
264+
#
265+
# Empirically validated on MI355X (conc-64 DEP8+MTP3, this config):
266+
# dispatch=32 -> gsm8k 0.00 (run 26913235190)
267+
# dispatch=64 -> gsm8k 0.00 (run 26919517564) # warpSize alone insufficient
268+
# dispatch>=256 -> gsm8k 0.94 (run 26912330265)
269+
# So clamp to 256. This only raises the value at low conc (high conc is already
270+
# larger); it adds a few MB of staging buffer but no compute, so real throughput
271+
# is unchanged (the ~3% edge of the corrupt run was an artifact of dropped work).
272+
# NOTE: 128 is untested; the proper upstream fix sizes the buffer from the
273+
# routing fan-in rather than a flat constant.
274+
MORI_DISPATCH_TOKENS_FLOOR=256
265275
if [[ "$MORI_MAX_DISPATCH_TOKENS_DECODE" -lt "$MORI_DISPATCH_TOKENS_FLOOR" ]]; then
266-
echo "[MoRI floor] DISPATCH_TOKENS=${MORI_MAX_DISPATCH_TOKENS_DECODE} < warpSize floor ${MORI_DISPATCH_TOKENS_FLOOR}; clamping to ${MORI_DISPATCH_TOKENS_FLOOR}"
276+
echo "[MoRI floor] DISPATCH_TOKENS=${MORI_MAX_DISPATCH_TOKENS_DECODE} < floor ${MORI_DISPATCH_TOKENS_FLOOR}; clamping to ${MORI_DISPATCH_TOKENS_FLOOR}"
267277
MORI_MAX_DISPATCH_TOKENS_DECODE=$MORI_DISPATCH_TOKENS_FLOOR
268278
fi
269279

perf-changelog.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3455,5 +3455,5 @@
34553455
- config-keys:
34563456
- dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp
34573457
description:
3458-
- "Throwaway: validate warpSize floor fix — clamp MORI_MAX_DISPATCH_TOKENS_DECODE >= 64 (CDNA3/4 wavefront) in server_sglang.sh. The MoRI All2All dispatch kernel writes warpSize-aligned receive slots (destTokId = flagSlotId*warpSize + laneId), so a per-rank buffer < 64 overruns its region -> silent corruption (conc-64/TP8/MTP3 -> 32 tokens -> gsm8k=0). If gsm8k recovers, 64 is the minimal correct floor (best perf vs the proven-but-larger 256)."
3458+
- "Fix MoRI dispatch-buffer corruption at low concurrency: clamp MORI_MAX_DISPATCH_TOKENS_DECODE >= 256 in server_sglang.sh. The harness sizes the per-rank All2All dispatch buffer from max(CONC_LIST)/TP*(MTP+1), which collapses to 32 at conc-64/TP8/MTP3 and silently corrupts the dispatch kernel's receive slots (decodes fine, gsm8k=0). Confirmed on MI355X: dispatch=32->0.00, dispatch=64->0.00 (warpSize alone insufficient), dispatch>=256->0.94. Throughput unchanged (the corrupt run's ~3% edge was dropped work)."
34593459
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1659

0 commit comments

Comments
 (0)