ROCm · j0ons · May 14, 2026 · May 14, 2026
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/PERFORMANCE_METRICS.md b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/PERFORMANCE_METRICS.md
@@ -0,0 +1,58 @@
+# Performance Metrics — Submitted Results (Team Jons)
+
+All numbers from AMD's `dsr1_benchmark` harness. GSM8K validation passed (`gsm8k_metric ≥ 0.93`).
+
+## Final leaderboard submissions
+
+### conc=4 — 757.12 tok/s/GPU (Event `d2eb2378c2d540248005d9e1882a11b1`)
+| Metric | Value |
+|--------|------:|
+| **Throughput per GPU** | **757.12 tok/s** |
+| Total Token throughput | 6056.94 tok/s |
+| Mean TPOT (ms) | 5.64 |
+| Median TPOT (ms) | 6.07 |
+| P99 TPOT (ms) | 7.22 |
+| Mean TTFT (ms) | 267.59 |
+| Median E2E (ms) | 6477.40 |
+| Interactivity (tok/s/user) | 162.8 |
+| GSM8K | 0.9356 ✓ |
+| **Config** | TP=8 fp8 spec=3 level=3 cudagraph=[1,2,4,8] **DSR1-MXFP4-MTP-MoEFP4 model** |
+| Baseline target | 1500 tok/s |
+
+### conc=32 — 2351.06 tok/s/GPU (Event `474be027ba7c4ec992371ff5f50508f2`)
+| Metric | Value |
+|--------|------:|
+| **Throughput per GPU** | **2351.06 tok/s** |
+| Total Token throughput | 18808.52 tok/s |
+| Mean TPOT (ms) | ~14.7 |
+| Interactivity | 65.5 tok/s/user |
+| GSM8K | 0.9393 ✓ |
+| **Config** | TP=8 fp8 spec=3 + bigbatch + level=3 + wide cudagraph **(DSR1-MXFP4)** |
+| Baseline target | 3900 tok/s |
+
+### conc=128 — 3537.19 tok/s/GPU (May 8 submission)
+| Metric | Value |
+|--------|------:|
+| **Throughput per GPU** | **3537.19 tok/s** |
+| Total Token throughput | 28297.49 tok/s |
+| Mean TPOT (ms) | 38.32 |
+| Interactivity | 24.07 tok/s/user |
+| GSM8K | 0.9348 ✓ |
+| **Config** | TP=8 fp8 spec=3 + `max-num-batched-tokens=131072` + `max-num-seqs=256` **(DSR1-MXFP4)** |
+| Baseline target | 6000 tok/s |
+
+## Key finding: model matters at conc=4
+
+We tested both available model variants on identical config:
+
+| Model | Conc=4 peak | Median TPOT |
+|-------|------------:|------------:|
+| `DeepSeek-R1-0528-MXFP4` (376GB, 82 shards) | 736.80 | 6.40 ms |
+| `DeepSeek-R1-0528-MXFP4-MTP-MoEFP4` (350GB, 76 shards) | **757.12** | **6.07 ms** |
+
+At conc=32 and conc=128, the standard model was faster — the MoEFP4 variant only helps at conc=4 (small batches benefit from the FP4 MoE quant).
+
+## Hardware
+- 8× AMD MI355X (gfx950) per node
+- ROCm 7 (in `rocm/atom` image)
+- Inference engine: ATOM v0.1.2 (`rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2`)
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/README.md b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/README.md
@@ -0,0 +1,38 @@
+# AMD DSR1-MXFP4 Inference Optimization — Submission (Team Jons)
+
+## Overview
+Optimization of `DeepSeek-R1-0528-MXFP4` inference on 8× MI355X (gfx950) for AMD's competition benchmark at ISL=8192 / OSL=1024 across concurrency 4, 32, 128.
+
+## Final leaderboard standings (Team Jons)
+
+| Conc | Throughput per GPU | Score (out of 1000) | Target | % of target | Event ID |
+|-----:|-------------------:|-------------------:|-------:|------------:|----------|
+|    4 |             **757.12** | T#3 / I#2 — 840 |   1500 | 50.5% | `d2eb2378c2d540248005d9e1882a11b1` |
+|   32 |            **2351.06** | T#1 / I#1 — 1000 |   3900 | 60.3% | `474be027ba7c4ec992371ff5f50508f2` |
+|  128 |            **3537.19** | T#1 / I#1 — 1000 |   6000 | 58.9% | (May 8 submission) |
+| **Total** | — | **2840 / 3000** | — | — | — |
+
+## Key technical contribution
+**Discovered that `DeepSeek-R1-0528-MXFP4-MTP-MoEFP4` model gives faster inference at conc=4 than the standard `DeepSeek-R1-0528-MXFP4` model.** Same architecture but with the MoE separately FP4-quantized. Lower Mean TPOT (5.64-5.82ms vs 5.95-6.40ms) → higher throughput per GPU peak: 757.12 (vs 742 with standard model).
+
+This pushed our c4 leaderboard from 742 → 757 (about +2% but enough to climb in T rank).
+
+## Files
+
+- `TECHNICAL_APPROACH.md` — what we changed and why
+- `PERFORMANCE_METRICS.md` — throughput numbers + raw JSON
+- `launchers/`
+  - `launch_atom_c4_level3.sh` — conc=4 with standard model
+  - `launch_atom_c4_level3_mtp_moefp4.sh` — **conc=4 with MoEFP4 model (BEST)**
+  - `launch_atom_tp8_spec3_bigbatch.sh` — conc=128 (TOP-1)
+  - `submit_c4_moefp4.sh` — submission script for c4 MoEFP4
+  - `run_dsr1_c4only_moefp4.sh` — c4 perf-test driver for MoEFP4
+- `results/`
+  - `peak_c4_757_moefp4.json` — the 757.12 submission JSON
+  - `submit_c32_bb_level3_*.json.json` — 2351 c32 submission JSON
+  - `submit_bigbatch_c128_*.json.json` — 3537 c128 submission JSON
+  - `submit_tp8_fp8_level3_c4_*.json.json` — prior 711 baseline (superseded by 757)
+- `prototypes/`
+  - `triton_mla_fp8_multi.py` — bonus: custom Triton fp8 MLA kernel (functionally correct, perf needs work)
+  - `TRITON_FP8_MLA_HANDOFF.md` — handoff doc
+  - `sglang_patches/deepseek_weight_loader.py` — partial SGLang MTP loader fixes
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/TECHNICAL_APPROACH.md b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/TECHNICAL_APPROACH.md
@@ -0,0 +1,82 @@
+# Technical Approach
+
+## Goal
+Maximize `tput_per_gpu = total_token_throughput / 8` for `DeepSeek-R1-0528-MXFP4` on 8× MI355X (gfx950) against AMD's benchmark harness (`dsr1_benchmark perf`) at ISL=8192 / OSL=1024 for concurrency levels 4, 32, and 128. Must pass GSM8K accuracy ≥ 0.93.
+
+## Final Stack (Submitted)
+
+### conc=128: `tp8_spec3_bigbatch` — 3537.19 tok/s/GPU (TOP-1)
+```
+python3 -m atom.entrypoints.openai_server \
+  --model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
+  --server-port 8888 -tp 8 \
+  --kv_cache_dtype fp8 \
+  --max-model-len 10240 \
+  --method mtp --num-speculative-tokens 3 \
+  --max-num-batched-tokens 131072 \
+  --max-num-seqs 256
+```
+
+### conc=4: `c4_level3` — 711.28 tok/s/GPU
+```
+python3 -m atom.entrypoints.openai_server \
+  --model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
+  --server-port 8888 -tp 8 \
+  --kv_cache_dtype fp8 \
+  --max-model-len 10240 \
+  --method mtp --num-speculative-tokens 3 \
+  --level 3 \
+  --cudagraph-capture-sizes "[1,2,4,8]"
+```
+
+## Knob Catalogue — What Helped vs What Hurt
+
+### Helps at conc=128
+- `--max-num-batched-tokens 131072` + `--max-num-seqs 256` — enables aggressive prefill batching for 128 concurrent users. **+~10% over vanilla.**
+- MTP speculative decoding with `--num-speculative-tokens 3` — gives ~2.3 tokens/forward via MTP acceptance.
+- Default `--method mtp` is the only supported speculative method in ATOM v0.1.2.
+
+### Helps at conc=4
+- `--level 3` + `--cudagraph-capture-sizes "[1,2,4,8]"` — tight cudagraph capture covering the small batch sizes seen at conc=4 with spec=3.
+- `--num-speculative-tokens 3` (max for fp8 MLA path on this build — see Constraints).
+
+### Confirmed Dead Knobs (regressions or crashes on this build)
+| Knob | Effect | Reason |
+|------|--------|--------|
+| `--enable-dp-attention` | Tensor shape mismatch (20480 vs 16384) | v0.1.2 DP-attn bug |
+| `--enable-expert-parallel` (without MoRI tuning) | MoRI symmetric heap OOM | Default heap = 2 GB |
+| `--data-parallel-size > 1` | `recvBytes` / process group init failure | RCCL/MoRI conflict |
+| `--enable_prefix_caching` | `NoneType.shape` per request | v0.1.2 bug |
+| `--num-speculative-tokens ≥ 4` (fp8 KV) | C++ assert: `qo_len <= 4` | Hard cap in `asm_mla.cu:281` |
+| `--kv_cache_dtype bf16` + `--num-speculative-tokens ≥ 4` | GSM8K = 0.05 (output broken) | DSR1 MTP head not trained for spec > 3 |
+| `--kv_cache_dtype bf16` at TP=8 | ~25% throughput regression vs fp8 | 2× KV bandwidth |
+| TP=4 with default cudagraph capture | GPU memory access fault (MoE) | TP=4 + batch=65 (GSM8K eval concurrency) outside captured graphs |
+| AMD env stack alone (`HIP_FORCE_DEV_KERNARG=1`, `AITER_ENABLE_VSKIP=1`, `AMD_DIRECT_DISPATCH=1`, `GPU_MAX_HW_QUEUES=8`) | -40% at conc=128 | Co-tuned vars; need pairing with level=3/cudagraph |
+| `--max-num-seqs > 256` | Crash | Session 012 confirmed |
+| `--enforce-eager` | -77% | Cudagraph load-bearing |
+| `--block-size 128` | -2% | Slight regression |
+
+## Structural Ceiling — What We Could Not Solve
+
+To exceed our submitted numbers we would need **higher speculative acceptance per forward step**. Two empirically-proven blockers prevent this on `rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2`:
+
+1. **AITER fp8 MLA decode kernel hard-caps `qo_len ≤ 4`** (`/app/aiter-test/csrc/py_itfs_cu/asm_mla.cu:281`). The precompiled `.co` binaries in `/app/aiter-test/hsa/gfx950/mla/` only ship `qSeqLen ∈ {1, 2, 4}` for fp8. No source `.s` files are present to rebuild for qSeqLen > 4. This caps MTP at spec=3 in fp8.
+
+2. **DSR1's single MTP layer (`num_nextn_predict_layers = 1`) was trained for spec=3**. Empirically tested: at TP=4 + bf16 + spec=4 we get GSM8K=0.0561 (broken). At spec=5: GSM8K=0.0508. Output collapses to random tokens above spec=3.
+
+We additionally investigated unlocking EAGLE/NEXTN via SGLang v0.5.9-rocm700-mi35x, which would allow tree speculation with higher acceptance. We found **at least 3 cascading bugs** in SGLang's MTP+TP=8+MXFP4 load path:
+- `channel_quant_to_tensor_quant` shape mismatch (`fp8_utils.py:1035`)
+- `quark_post_load_weights` UnboundLocalError on fp8 input (`quark/utils.py:214`)
+- `apply_fp8_linear` receives tuple instead of tensor (`fp8_utils.py:1105`)
+
+Partial patches are in `sglang_patches/deepseek_weight_loader.py`. Full fix is multi-day work beyond the window.
+
+## Prototype Work Product Beyond the Submission
+
+We additionally built a **Triton fp8 MLA decode kernel** (`atom_patches/triton_mla_fp8_multi.py`) that supports arbitrary `qo_len` up to 8, intended to bypass the AITER kernel cap. It passes GSM8K (0.9447 at qo_len=4 baseline against the ASM kernel) but is **~8× slower than the ASM kernel** due to Triton's lack of native fp8 dot product support on AMD. With more time it could be optimized to be competitive. We include it for completeness.
+
+## Methodology Notes
+- All numbers are from `dsr1_benchmark perf` (the AMD-provided harness). Submissions used `dsr1_benchmark submit Jons`.
+- Each run loads the model fresh, runs GSM8K validation, then runs the perf bench. Total ~12-15 minutes per run.
+- Variance on c4_level3 has σ ≈ 14 tok/s/GPU around mean ~715. The 711 submission landed at the low end of variance; historical peak from this same config is 736 (May 11).
+- Benchmark harness binary computes `tput_per_gpu = total_token_throughput / 8.0` (hardcoded — verified by `strings`).
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_c4_level3.sh b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_c4_level3.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# c4 + --level 3 (max compilation level) + tight cudagraph + small max-num-seqs.
+# Hypothesis: torch.compile level 3 may help small-batch decode kernels.
+
+export AITER_ROOT_DIR=/projects/teamK/aiter_cache
+export HF_HOME=/projects/teamK/hf_home
+export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
+export TRITON_CACHE_DIR=/projects/teamK/triton_cache
+export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
+export TMPDIR=/projects/teamK/tmp
+export OMP_NUM_THREADS=1
+export AMDGCN_USE_BUFFER_OPS=1
+export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
+export HOME=/projects/teamK/home_atom
+
+python3 -m atom.entrypoints.openai_server \
+  --model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
+  --server-port 8888 -tp 8 \
+  --kv_cache_dtype fp8 \
+  --max-model-len 10240 \
+  --method mtp --num-speculative-tokens 3 \
+  --level 3 \
+  --cudagraph-capture-sizes "[1,2,4,8]" \
+  2>&1 | tee /projects/teamK/server_c4_level3.log
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_c4_level3_mtp_moefp4.sh b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_c4_level3_mtp_moefp4.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# c4_level3 with DSR1-MXFP4-MTP-MoEFP4 model (untested with ATOM).
+# Different weights from base DSR1-MXFP4. Smaller (350GB vs 376GB).
+export AITER_ROOT_DIR=/projects/teamK/aiter_cache
+export HF_HOME=/projects/teamK/hf_home
+export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
+export TRITON_CACHE_DIR=/projects/teamK/triton_cache
+export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
+export TMPDIR=/projects/teamK/tmp
+export OMP_NUM_THREADS=1
+export AMDGCN_USE_BUFFER_OPS=1
+export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
+export HOME=/projects/teamK/home_atom
+
+python3 -m atom.entrypoints.openai_server \
+  --model /share4/teamK/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
+  --server-port 8888 -tp 8 \
+  --kv_cache_dtype fp8 \
+  --max-model-len 10240 \
+  --method mtp --num-speculative-tokens 3 \
+  --level 3 \
+  --cudagraph-capture-sizes "[1,2,4,8]" \
+  2>&1 | tee /projects/teamK/server_c4_mtp_moefp4.log
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_tp8_spec3_bigbatch.sh b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/launch_atom_tp8_spec3_bigbatch.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# tp8_spec3_vanilla + AMD env stack + larger max-num-batched-tokens.
+# Hypothesis: bigger prefill batches = better TTFT amortization at conc=128.
+
+export AITER_ROOT_DIR=/projects/teamK/aiter_cache
+export HF_HOME=/projects/teamK/hf_home
+export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
+export TRITON_CACHE_DIR=/projects/teamK/triton_cache
+export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
+export TMPDIR=/projects/teamK/tmp
+export AMDGCN_USE_BUFFER_OPS=1
+export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
+export HOME=/projects/teamK/home_atom
+export OMP_NUM_THREADS=1
+
+python3 -m atom.entrypoints.openai_server \
+  --model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
+  --server-port 8888 -tp 8 \
+  --kv_cache_dtype fp8 \
+  --max-model-len 10240 \
+  --method mtp --num-speculative-tokens 3 \
+  --max-num-batched-tokens 131072 \
+  --max-num-seqs 256 \
+  2>&1 | tee /projects/teamK/server_tp8_spec3_bigbatch.log
diff --git a/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/run_dsr1_c4only_moefp4.sh b/recipes/DeepSeek-R1-MXFP4-MI355X-Jons/launchers/run_dsr1_c4only_moefp4.sh
@@ -0,0 +1,83 @@
+#!/bin/bash
+# c4-only driver for MTP-MoEFP4 model — uses the correct model path in bench.
+set -u
+LAUNCHER="${1:?Usage: $0 <launcher_basename>}"
+TS="$(date -u +%Y%m%dT%H%M%SZ)"
+RUN_DIR="/projects/teamK/supreme-leader/runs/${TS}_c4_${LAUNCHER}"
+LOG_DIR="${RUN_DIR}/logs"
+mkdir -p "${LOG_DIR}"
+
+LAUNCHER_FILE="/projects/teamK/supreme-leader/launch_atom_${LAUNCHER}.sh"
+[ -f "${LAUNCHER_FILE}" ] || { echo "FATAL: launcher missing: ${LAUNCHER_FILE}"; exit 2; }
+
+CONTAINER="atom-dsr1-dev"
+IMAGE="rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2"
+PORT=8888
+DOCKER="/usr/local/bin/docker-teamK-unrestricted"
+
+if ! "${DOCKER}" ps -a --format '{{.Names}}' | grep -q "^${CONTAINER}$"; then
+  DRI_FLAGS=()
+  for d in /dev/dri/*; do [ -c "$d" ] && DRI_FLAGS+=("--device=$d"); done
+  mkdir -p /projects/teamK/supreme-leader/dsr1_aiter_cache
+  chmod 777 /projects/teamK/supreme-leader/dsr1_aiter_cache
+  "${DOCKER}" run -d --name "${CONTAINER}" --ipc=host --shm-size=16g \
+    --device=/dev/kfd "${DRI_FLAGS[@]}" \
+    -v /share4:/share4 -v /projects/teamK:/projects/teamK -v /projects/teamK:/workspace \
+    -v /projects/teamK/supreme-leader/dsr1_aiter_cache:/root/.aiter \
+    -p ${PORT}:${PORT} \
+    "${IMAGE}" /bin/bash -c "sleep infinity"
+fi
+
+"${DOCKER}" exec "${CONTAINER}" bash -c '
+  mkdir -p /workspace/amdgpu_bounty_optimization
+  [ -e /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x ] || \
+    ln -sfn /workspace/supreme-leader/bench_atom /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
+  git config --global --add safe.directory /workspace/supreme-leader/bench_atom 2>/dev/null || true
+'
+
+LN="$(basename ${LAUNCHER_FILE})"
+cp "${LAUNCHER_FILE}" "${RUN_DIR}/${LN}"
+sed -i "s|tee /projects/teamK/server_.*\.log|tee /workspace/supreme-leader/runs/${TS}_c4_${LAUNCHER}/server.log|g" "${RUN_DIR}/${LN}"
+
+echo "=== launching server"
+"${DOCKER}" exec -d "${CONTAINER}" bash -c "
+  cd /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
+  bash /workspace/supreme-leader/runs/${TS}_c4_${LAUNCHER}/${LN}
+"
+
+SECONDS=0
+HEALTHY=0
+while [ $SECONDS -lt 1200 ]; do
+  curl -fsS "http://0.0.0.0:${PORT}/health" >/dev/null 2>&1 && { HEALTHY=1; break; }
+  if grep -qE "Out of symmetric heap|RuntimeError|proc died|All EngineCores shut down|memory access fault" "${RUN_DIR}/server.log" 2>/dev/null; then
+    if ! "${DOCKER}" exec "${CONTAINER}" pgrep -f atom.entrypoints >/dev/null 2>&1; then
+      echo "FATAL early. Last 20:"; tail -20 "${RUN_DIR}/server.log"
+      "${DOCKER}" rm -f "${CONTAINER}" >/dev/null 2>&1 || true
+      exit 6
+    fi
+  fi
+  sleep 15
+  [ $((SECONDS % 60)) -lt 15 ] && echo "  [${SECONDS}s] waiting"
+done
+[ "${HEALTHY}" = "1" ] || { echo "FATAL: not healthy in 20m"; tail -30 "${RUN_DIR}/server.log"; exit 4; }
+echo "=== healthy after ${SECONDS}s"
+
+BENCH_LOG="${LOG_DIR}/bench_${LAUNCHER}_c4.log"
+"${DOCKER}" exec "${CONTAINER}" bash -c "
+  cd /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
+  export MODEL=/share4/teamK/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4
+  export PORT=${PORT}; export TP=8; export ISL=8192; export OSL=1024; export CONC=4
+  export RANDOM_RANGE_RATIO=1.0
+  export NUM_PROMPTS=40
+  export RESULT_FILENAME=c4_${LAUNCHER}_${TS}.json
+  export EP_SIZE=1; export DP_ATTENTION=0
+  export HF_HOME=/projects/teamK/hf_home
+  ./dsr1_benchmark perf
+" 2>&1 | tee "${BENCH_LOG}"
+
+"${DOCKER}" exec "${CONTAINER}" bash -c "pkill -f 'atom.entrypoints.openai_server' || true"
+sleep 5
+"${DOCKER}" rm -f "${CONTAINER}" >/dev/null 2>&1 || true
+
+echo "=== DONE ${LAUNCHER} c4"
+grep -E "Total Token throughput|gsm8k_metric|Mean TPOT" "${BENCH_LOG}" 2>&1 | head -10