SemiAnalysisAI
diff --git a/‎benchmarks/multi_node/amd_utils/job.slurm‎
Lines changed: 0 additions & 21 deletions b/‎benchmarks/multi_node/amd_utils/job.slurm‎
Lines changed: 0 additions & 21 deletions
diff --git a/‎benchmarks/multi_node/amd_utils/patches/README.md‎
Lines changed: 29 additions & 15 deletions b/‎benchmarks/multi_node/amd_utils/patches/README.md‎
Lines changed: 29 additions & 15 deletions
diff --git a/‎benchmarks/multi_node/amd_utils/patches/apply_moriep_dispatch_floor.py‎
Lines changed: 127 additions & 0 deletions b/‎benchmarks/multi_node/amd_utils/patches/apply_moriep_dispatch_floor.py‎
Lines changed: 127 additions & 0 deletions
@@ -79,27 +79,6 @@ if [[ "${MORI_CONN_PATCH:-auto}" != "skip" ]] \
     echo "[job.slurm] auto-applied MoRI conn.py overlay: ${_MORI_PATCH_FILE}"
 fi
 
-# ── MoRI dispatch-buffer corruption fix: moriep.py overlay ────────────
-# sglang v0.5.12.post1 silently corrupts the MoRI EP dispatch path when the
-# per-rank dispatch buffer (num_max_dispatch_tokens_per_rank) is small: the
-# receive buffer is sized worldSize*maxNumInpTokenPerRank and the only overflow
-# guard is an assert() compiled out in release builds, so low concurrency
-# (e.g. conc-64 DEP8+MTP3 -> 32 tokens) yields out-of-bounds writes and gsm8k=0.
-# The overlay floors num_max_dispatch_tokens_per_rank to 256 at its env read
-# (the single source of truth for kernel selection + buffer sizing). The base
-# file is byte-identical to upstream v0.5.12.post1 (md5 ac626f5459...), so the
-# overlay is a +22-line diff. See patches/README.md and sgl-project/sglang#27194.
-_MORIEP_PATCH_FILE="$DI_REPO_DIR/benchmarks/multi_node/amd_utils/patches/moriep.py"
-_MORIEP_PATCH_TARGET="/sgl-workspace/sglang/python/sglang/srt/layers/moe/token_dispatcher/moriep.py"
-if [[ "${MORIEP_PATCH:-auto}" != "skip" ]] \
-   && [[ -f "$_MORIEP_PATCH_FILE" ]] \
-   && [[ "${DOCKER_IMAGE_NAME:-}" == *"v0.5.12.post1"* ]] \
-   && [[ "${EXTRA_DOCKER_MOUNTS:-}" != *"$_MORIEP_PATCH_TARGET"* ]]; then
-    EXTRA_DOCKER_MOUNTS="${EXTRA_DOCKER_MOUNTS:-} -v ${_MORIEP_PATCH_FILE}:${_MORIEP_PATCH_TARGET}:ro"
-    export EXTRA_DOCKER_MOUNTS
-    echo "[job.slurm] auto-applied MoRI moriep.py dispatch-floor overlay: ${_MORIEP_PATCH_FILE}"
-fi
-
 xP="${xP:-1}"
 yD="${yD:-1}"
 
 
@@ -60,16 +60,26 @@ This is a stop-gap. The proper upstream fix is to migrate MoRI to the
 plural `state_types: List[StateType]` API (full design + diff in
 `scripts/sglang_disagg/docs/03-upstream-pr-proposal.md`).
 
-## `moriep.py`
-
-Overlays
-`/sgl-workspace/sglang/python/sglang/srt/layers/moe/token_dispatcher/moriep.py`.
-
-Source: forked from `lmsysorg/sglang-rocm:v0.5.12.post1-*` (sglang
-[v0.5.12.post1](https://github.com/sgl-project/sglang/tree/v0.5.12.post1)).
-The base file is **byte-identical to the upstream tag**
-(`md5 ac626f5459a699f9ac953d9d8e71d861`); the overlay is a single
-+22-line insertion in `MoriTokenDispatcher.__init__`.
+## `apply_moriep_dispatch_floor.py` (in-place patch, NOT a bind-mount overlay)
+
+This one is different from `mori_conn.py`: it is a **surgical in-place
+patch script**, not a full-file bind-mount overlay. It is run inside the
+container by `server_sglang.sh` (right after `env.sh`) and edits the
+installed
+`/sgl-workspace/sglang/.../token_dispatcher/moriep.py`
+in place, injecting a single floor after the dispatch-token env read.
+
+**Why not a bind-mount overlay (learned the hard way):** the
+`lmsysorg/sglang-rocm:v0.5.12.post1-*` image ships a **downstream-patched
+`moriep.py`** (class `MoriEPDispatcher`, with attrs such as
+`expert_mask_gpu`) that diverges from the upstream
+[v0.5.12.post1](https://github.com/sgl-project/sglang/tree/v0.5.12.post1)
+tag. A full-file overlay of the upstream file (even one byte-identical to
+the tag, `md5 ac626f5459...`) reverts the AMD additions and crashes the
+scheduler at init: `AttributeError: 'MoriEPDispatcher' object has no
+attribute 'expert_mask_gpu'`. The in-place patch touches only the
+dispatch-token read and preserves all downstream code, so it is robust to
+the vendor fork.
 
 **Bug it fixes:** at low concurrency the MoRI EP dispatch path silently
 corrupts output (decodes fine, acceptance length stays high, but gsm8k
@@ -85,19 +95,23 @@ guard is `assert(destTokId < MaxNumTokensToRecv())`, compiled out under
 `-DNDEBUG`, so the result is silent out-of-bounds writes
 (`internode_v1.cpp` `DispatchIntraNodeBlock`).
 
-The overlay floors `num_max_dispatch_tokens_per_rank` to **256** right at
+The patch floors `num_max_dispatch_tokens_per_rank` to **256** right at
 its env read — the single source of truth that feeds both
 `get_ep_dispatch_configs()` (kernel selection) and the buffer-sizing
-arg. Empirically validated on MI355X (conc-64 DEP8+MTP3):
-dispatch `32 → gsm8k 0.00`, `64 → 0.00` (one wavefront is not enough),
-`256 → 0.94`.
+arg. It is idempotent and fail-loud-but-non-fatal (a structure miss prints
+a clear marker plus the surrounding source and lets the server proceed).
+Empirically validated on MI355X (conc-64 DEP8+MTP3): dispatch `32 →
+gsm8k 0.00`, `64 → 0.00` (one wavefront is not enough), `256 → 0.94`.
 
 This is a stop-gap. The proper upstream fix is in MoRI: size the receive
 buffer from the routing fan-in and turn the compiled-out `assert` into a
 real bounds guard (see [ROCm/mori#356](https://github.com/ROCm/mori/issues/356)).
 The integration-level guard belongs in sglang's `moriep.py`
 ([sgl-project/sglang#27194](https://github.com/sgl-project/sglang/issues/27194)) —
-this overlay is exactly that guard, pending upstream merge.
+this patch is exactly that guard, pending upstream merge. No
+`EXTRA_DOCKER_MOUNTS` wiring is needed; the patch is applied
+unconditionally by `server_sglang.sh` and no-ops when the value is
+already ≥256 (e.g. prefill, which uses 8192).
 
 ## How to enable
 
 
@@ -0,0 +1,127 @@
+#!/usr/bin/env python3
+"""Surgically floor the MoRI per-rank dispatch buffer to >=256 in the installed
+sglang `moriep.py`, in place, inside the container.
+
+Why in-place (not a bind-mount overlay): the lmsysorg/sglang-rocm image ships a
+*downstream-patched* moriep.py (class `MoriEPDispatcher`, extra attrs such as
+`expert_mask_gpu`) that diverges from the upstream v0.5.12.post1 tag. A full-file
+overlay of the upstream file reverts those AMD additions and crashes the
+scheduler at init (`AttributeError: ... 'expert_mask_gpu'`). So we patch the
+image's own file and touch only the dispatch-token read.
+
+The bug being fixed: at low concurrency the per-rank dispatch buffer
+(num_max_dispatch_tokens_per_rank -> mori max_num_inp_token_per_rank) collapses
+(conc-64/TP8/MTP3 -> 64/8*4 = 32). MoRI sizes its receive buffer
+MaxNumTokensToRecv() = worldSize * maxNumInpTokenPerRank (dispatch_combine.hpp;
+max_total_recv_tokens defaults to 0 -> that fallback, and it is a cap not a
+floor). The intra-node dispatch kernel's per-dest atomic counter then overruns
+the buffer; the only guard is assert(destTokId < MaxNumTokensToRecv()) which is
+compiled out under -DNDEBUG -> silent out-of-bounds writes -> output that decodes
+fine (high acceptance length) but is semantically garbage (gsm8k=0).
+
+Empirically on MI355X (conc-64 DEP8+MTP3): dispatch 32 -> gsm8k 0.00,
+64 -> 0.00 (one wavefront insufficient), 256 -> 0.94. We floor to 256.
+
+Idempotent and fail-loud-but-non-fatal: a regex/structure miss prints a clear
+marker and the surrounding source (for diagnosis) but does not abort the server.
+
+Upstream: sgl-project/sglang#27194, ROCm/mori#356.
+"""
+import os
+import re
+import sys
+
+FLOOR = 256
+MARKER = "[InferenceX moriep dispatch floor]"
+TAG = "[moriep-floor]"
+
+
+def find_target():
+    try:
+        import sglang
+    except Exception as e:  # pragma: no cover
+        print(f"{TAG} ERROR: could not import sglang ({e}); NOT patched")
+        return None
+    path = os.path.join(
+        os.path.dirname(sglang.__file__),
+        "srt", "layers", "moe", "token_dispatcher", "moriep.py",
+    )
+    if not os.path.isfile(path):
+        print(f"{TAG} ERROR: moriep.py not found at {path}; NOT patched")
+        return None
+    return path
+
+
+def main():
+    path = find_target()
+    if path is None:
+        return 0  # non-fatal
+
+    with open(path) as f:
+        src = f.read()
+    lines = src.splitlines(keepends=True)
+
+    # Diagnostic: always show where the dispatch-token count is read/used so the
+    # CI log reveals the image's actual file shape even on a clean apply.
+    for i, l in enumerate(lines):
+        if "num_max_dispatch_tokens_per_rank" in l:
+            print(f"{TAG}[diag] {path}:{i + 1}: {l.rstrip()}")
+
+    if MARKER in src:
+        print(f"{TAG} already applied; skipping")
+        return 0
+
+    # Find the assignment that reads the env var, regardless of class name or
+    # formatting: `self.num_max_dispatch_tokens_per_rank = get_int_env_var(`.
+    start = None
+    for i, l in enumerate(lines):
+        if re.search(
+            r"self\.num_max_dispatch_tokens_per_rank\s*=\s*get_int_env_var\s*\(",
+            l,
+        ):
+            start = i
+            break
+    if start is None:
+        print(
+            f"{TAG} ERROR: dispatch-token env read not found in {path}; "
+            f"NOT patched (server will run UNPATCHED -> expect corruption at "
+            f"low conc). See [diag] lines above for the actual source shape."
+        )
+        return 0  # non-fatal: surface loudly but let the run proceed
+
+    # Walk forward to the end of the (possibly multi-line) call by balancing parens.
+    depth = 0
+    end = start
+    for j in range(start, len(lines)):
+        depth += lines[j].count("(") - lines[j].count(")")
+        if depth <= 0:
+            end = j
+            break
+
+    indent = re.match(r"\s*", lines[start]).group(0)
+    floor_block = (
+        f"{indent}# {MARKER} floor to {FLOOR} (warpSize/fan-in safe). MoRI recv buffer\n"
+        f"{indent}# is worldSize*maxNumInpTokenPerRank; values below {FLOOR} silently\n"
+        f"{indent}# corrupt the dispatch path (gsm8k=0). sgl#27194 / mori#356.\n"
+        f"{indent}self.num_max_dispatch_tokens_per_rank = max(\n"
+        f"{indent}    self.num_max_dispatch_tokens_per_rank, {FLOOR}\n"
+        f"{indent})\n"
+    )
+    lines.insert(end + 1, floor_block)
+
+    try:
+        with open(path, "w") as f:
+            f.write("".join(lines))
+    except OSError as e:
+        print(f"{TAG} ERROR: could not write {path} ({e}); NOT patched")
+        return 0
+
+    print(
+        f"{TAG} applied: floored num_max_dispatch_tokens_per_rank to >= {FLOOR} "
+        f"in {path} (after line {end + 1})"
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())