raise MoRI dispatch-buffer floor to 256 (warpSize=64 proven insufficient)

Oseltamivir · Oseltamivir · commit 272be180b3a6 · 2026-06-03T16:50:52.000-07:00
The conc-64 run with the warpSize floor (64) still scored gsm8k=0.00
(run 26919517564), disproving the one-wavefront hypothesis. The per-rank
dispatch buffer must hold the routing fan-in (a receiving rank takes tokens
from all worldSize peers), not just one warp-chunk. Empirically on MI355X:
dispatch=32 -&gt; 0.00, dispatch=64 -&gt; 0.00, dispatch&gt;=256 -&gt; 0.94. Clamp to the
proven 256. Throughput is unchanged; the corrupt run's ~3% edge was dropped
work, not real speed.
diff --git a/benchmarks/multi_node/amd_utils/server_sglang.sh b/benchmarks/multi_node/amd_utils/server_sglang.sh
@@ -248,22 +248,32 @@ if [[ "$DECODE_MTP_SIZE" -gt 0 ]]; then
     MORI_MOE_MAX_INPUT_TOKENS_DECODE=$((MORI_MOE_MAX_INPUT_TOKENS_DECODE * (DECODE_MTP_SIZE + 1)))
 fi
 
-# ── MoRI dispatch-buffer warpSize floor ──────────────────────────────────────
+# ── MoRI dispatch-buffer minimum floor ───────────────────────────────────────
 # The MoRI All2All dispatch kernel (EpDispatchInterNodeV1Kernel / IntraNode)
-# places dispatched tokens in warpSize-aligned receive slots:
-#     destTokId = flagSlotId * warpSize + laneId      (laneId = 0..warpSize-1)
-# i.e. each warp writes up to warpSize=64 (CDNA3/4 wavefront) token slots per
-# chunk. The per-rank receive region is sized to maxNumInpTokenPerRank, which
-# the harness derives from max(CONC_LIST)/TP*(MTP+1). At low concurrency this
-# collapses below 64 (e.g. conc-64 / TP8 / MTP3 -> 64/8*4 = 32), so a single
-# warp-chunk overruns the 32-slot region -> silent out-of-bounds writes ->
-# semantically corrupt output (decodes fine, gsm8k = 0.0). Clamp to one
-# wavefront so the slot arithmetic is always in-bounds. This only raises the
-# value at low conc (high conc is naturally larger); it adds a few MB of
-# staging buffer but no compute, so real throughput is unchanged.
-MORI_DISPATCH_TOKENS_FLOOR=64
+# silently corrupts output when the per-rank dispatch buffer
+# (maxNumInpTokenPerRank = SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK) is too
+# small. The harness derives that value from max(CONC_LIST)/TP*(MTP+1), which
+# collapses at low concurrency (conc-64 / TP8 / MTP3 -> 64/8*4 = 32). Two things
+# break: (1) the kernel writes tokens in warpSize-aligned chunks
+# (destTokId = flagSlotId*warpSize + laneId, laneId 0..63), so a buffer < 64
+# can't even hold one wavefront; (2) a receiving rank takes tokens from all
+# `worldSize` peers, so the per-rank buffer must hold the routing fan-in, not
+# just the local token count. The result is out-of-bounds receive-slot writes
+# -> output that decodes fine (acceptance length stays high) but is semantically
+# garbage (gsm8k = 0.0).
+#
+# Empirically validated on MI355X (conc-64 DEP8+MTP3, this config):
+#     dispatch=32 -> gsm8k 0.00   (run 26913235190)
+#     dispatch=64 -> gsm8k 0.00   (run 26919517564)  # warpSize alone insufficient
+#     dispatch>=256 -> gsm8k 0.94 (run 26912330265)
+# So clamp to 256. This only raises the value at low conc (high conc is already
+# larger); it adds a few MB of staging buffer but no compute, so real throughput
+# is unchanged (the ~3% edge of the corrupt run was an artifact of dropped work).
+# NOTE: 128 is untested; the proper upstream fix sizes the buffer from the
+# routing fan-in rather than a flat constant.
+MORI_DISPATCH_TOKENS_FLOOR=256
 if [[ "$MORI_MAX_DISPATCH_TOKENS_DECODE" -lt "$MORI_DISPATCH_TOKENS_FLOOR" ]]; then
-    echo "[MoRI floor] DISPATCH_TOKENS=${MORI_MAX_DISPATCH_TOKENS_DECODE} < warpSize floor ${MORI_DISPATCH_TOKENS_FLOOR}; clamping to ${MORI_DISPATCH_TOKENS_FLOOR}"
+    echo "[MoRI floor] DISPATCH_TOKENS=${MORI_MAX_DISPATCH_TOKENS_DECODE} < floor ${MORI_DISPATCH_TOKENS_FLOOR}; clamping to ${MORI_DISPATCH_TOKENS_FLOOR}"
     MORI_MAX_DISPATCH_TOKENS_DECODE=$MORI_DISPATCH_TOKENS_FLOOR
 fi
 
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -3455,5 +3455,5 @@
 - config-keys:
     - dsr1-fp4-mi355x-sglang-disagg-8k1k-mtp
   description:
-    - "Throwaway: validate warpSize floor fix — clamp MORI_MAX_DISPATCH_TOKENS_DECODE >= 64 (CDNA3/4 wavefront) in server_sglang.sh. The MoRI All2All dispatch kernel writes warpSize-aligned receive slots (destTokId = flagSlotId*warpSize + laneId), so a per-rank buffer < 64 overruns its region -> silent corruption (conc-64/TP8/MTP3 -> 32 tokens -> gsm8k=0). If gsm8k recovers, 64 is the minimal correct floor (best perf vs the proven-but-larger 256)."
+    - "Fix MoRI dispatch-buffer corruption at low concurrency: clamp MORI_MAX_DISPATCH_TOKENS_DECODE >= 256 in server_sglang.sh. The harness sizes the per-rank All2All dispatch buffer from max(CONC_LIST)/TP*(MTP+1), which collapses to 32 at conc-64/TP8/MTP3 and silently corrupts the dispatch kernel's receive slots (decodes fine, gsm8k=0). Confirmed on MI355X: dispatch=32->0.00, dispatch=64->0.00 (warpSize alone insufficient), dispatch>=256->0.94. Throughput unchanged (the corrupt run's ~3% edge was dropped work)."
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1659