Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions recipes/DeepSeek-R1-MXFP4-MI355X-Jons/PERFORMANCE_METRICS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Performance Metrics — Submitted Results (Team Jons)

All numbers from AMD's `dsr1_benchmark` harness. GSM8K validation passed (`gsm8k_metric ≥ 0.93`).

## Final leaderboard submissions

### conc=4 — 757.12 tok/s/GPU (Event `d2eb2378c2d540248005d9e1882a11b1`)
| Metric | Value |
|--------|------:|
| **Throughput per GPU** | **757.12 tok/s** |
| Total Token throughput | 6056.94 tok/s |
| Mean TPOT (ms) | 5.64 |
| Median TPOT (ms) | 6.07 |
| P99 TPOT (ms) | 7.22 |
| Mean TTFT (ms) | 267.59 |
| Median E2E (ms) | 6477.40 |
| Interactivity (tok/s/user) | 162.8 |
| GSM8K | 0.9356 ✓ |
| **Config** | TP=8 fp8 spec=3 level=3 cudagraph=[1,2,4,8] **DSR1-MXFP4-MTP-MoEFP4 model** |
| Baseline target | 1500 tok/s |

### conc=32 — 2351.06 tok/s/GPU (Event `474be027ba7c4ec992371ff5f50508f2`)
| Metric | Value |
|--------|------:|
| **Throughput per GPU** | **2351.06 tok/s** |
| Total Token throughput | 18808.52 tok/s |
| Mean TPOT (ms) | ~14.7 |
| Interactivity | 65.5 tok/s/user |
| GSM8K | 0.9393 ✓ |
| **Config** | TP=8 fp8 spec=3 + bigbatch + level=3 + wide cudagraph **(DSR1-MXFP4)** |
| Baseline target | 3900 tok/s |

### conc=128 — 3537.19 tok/s/GPU (May 8 submission)
| Metric | Value |
|--------|------:|
| **Throughput per GPU** | **3537.19 tok/s** |
| Total Token throughput | 28297.49 tok/s |
| Mean TPOT (ms) | 38.32 |
| Interactivity | 24.07 tok/s/user |
| GSM8K | 0.9348 ✓ |
| **Config** | TP=8 fp8 spec=3 + `max-num-batched-tokens=131072` + `max-num-seqs=256` **(DSR1-MXFP4)** |
| Baseline target | 6000 tok/s |

## Key finding: model matters at conc=4

We tested both available model variants on identical config:

| Model | Conc=4 peak | Median TPOT |
|-------|------------:|------------:|
| `DeepSeek-R1-0528-MXFP4` (376GB, 82 shards) | 736.80 | 6.40 ms |
| `DeepSeek-R1-0528-MXFP4-MTP-MoEFP4` (350GB, 76 shards) | **757.12** | **6.07 ms** |

At conc=32 and conc=128, the standard model was faster — the MoEFP4 variant only helps at conc=4 (small batches benefit from the FP4 MoE quant).

## Hardware
- 8× AMD MI355X (gfx950) per node
- ROCm 7 (in `rocm/atom` image)
- Inference engine: ATOM v0.1.2 (`rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2`)
38 changes: 38 additions & 0 deletions recipes/DeepSeek-R1-MXFP4-MI355X-Jons/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# AMD DSR1-MXFP4 Inference Optimization — Submission (Team Jons)

## Overview
Optimization of `DeepSeek-R1-0528-MXFP4` inference on 8× MI355X (gfx950) for AMD's competition benchmark at ISL=8192 / OSL=1024 across concurrency 4, 32, 128.

## Final leaderboard standings (Team Jons)

| Conc | Throughput per GPU | Score (out of 1000) | Target | % of target | Event ID |
|-----:|-------------------:|-------------------:|-------:|------------:|----------|
| 4 | **757.12** | T#3 / I#2 — 840 | 1500 | 50.5% | `d2eb2378c2d540248005d9e1882a11b1` |
| 32 | **2351.06** | T#1 / I#1 — 1000 | 3900 | 60.3% | `474be027ba7c4ec992371ff5f50508f2` |
| 128 | **3537.19** | T#1 / I#1 — 1000 | 6000 | 58.9% | (May 8 submission) |
| **Total** | — | **2840 / 3000** | — | — | — |

## Key technical contribution
**Discovered that `DeepSeek-R1-0528-MXFP4-MTP-MoEFP4` model gives faster inference at conc=4 than the standard `DeepSeek-R1-0528-MXFP4` model.** Same architecture but with the MoE separately FP4-quantized. Lower Mean TPOT (5.64-5.82ms vs 5.95-6.40ms) → higher throughput per GPU peak: 757.12 (vs 742 with standard model).

This pushed our c4 leaderboard from 742 → 757 (about +2% but enough to climb in T rank).

## Files

- `TECHNICAL_APPROACH.md` — what we changed and why
- `PERFORMANCE_METRICS.md` — throughput numbers + raw JSON
- `launchers/`
- `launch_atom_c4_level3.sh` — conc=4 with standard model
- `launch_atom_c4_level3_mtp_moefp4.sh` — **conc=4 with MoEFP4 model (BEST)**
- `launch_atom_tp8_spec3_bigbatch.sh` — conc=128 (TOP-1)
- `submit_c4_moefp4.sh` — submission script for c4 MoEFP4
- `run_dsr1_c4only_moefp4.sh` — c4 perf-test driver for MoEFP4
- `results/`
- `peak_c4_757_moefp4.json` — the 757.12 submission JSON
- `submit_c32_bb_level3_*.json.json` — 2351 c32 submission JSON
- `submit_bigbatch_c128_*.json.json` — 3537 c128 submission JSON
- `submit_tp8_fp8_level3_c4_*.json.json` — prior 711 baseline (superseded by 757)
- `prototypes/`
- `triton_mla_fp8_multi.py` — bonus: custom Triton fp8 MLA kernel (functionally correct, perf needs work)
- `TRITON_FP8_MLA_HANDOFF.md` — handoff doc
- `sglang_patches/deepseek_weight_loader.py` — partial SGLang MTP loader fixes
82 changes: 82 additions & 0 deletions recipes/DeepSeek-R1-MXFP4-MI355X-Jons/TECHNICAL_APPROACH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Technical Approach

## Goal
Maximize `tput_per_gpu = total_token_throughput / 8` for `DeepSeek-R1-0528-MXFP4` on 8× MI355X (gfx950) against AMD's benchmark harness (`dsr1_benchmark perf`) at ISL=8192 / OSL=1024 for concurrency levels 4, 32, and 128. Must pass GSM8K accuracy ≥ 0.93.

## Final Stack (Submitted)

### conc=128: `tp8_spec3_bigbatch` — 3537.19 tok/s/GPU (TOP-1)
```
python3 -m atom.entrypoints.openai_server \
--model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
--server-port 8888 -tp 8 \
--kv_cache_dtype fp8 \
--max-model-len 10240 \
--method mtp --num-speculative-tokens 3 \
--max-num-batched-tokens 131072 \
--max-num-seqs 256
```

### conc=4: `c4_level3` — 711.28 tok/s/GPU
```
python3 -m atom.entrypoints.openai_server \
--model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
--server-port 8888 -tp 8 \
--kv_cache_dtype fp8 \
--max-model-len 10240 \
--method mtp --num-speculative-tokens 3 \
--level 3 \
--cudagraph-capture-sizes "[1,2,4,8]"
```

## Knob Catalogue — What Helped vs What Hurt

### Helps at conc=128
- `--max-num-batched-tokens 131072` + `--max-num-seqs 256` — enables aggressive prefill batching for 128 concurrent users. **+~10% over vanilla.**
- MTP speculative decoding with `--num-speculative-tokens 3` — gives ~2.3 tokens/forward via MTP acceptance.
- Default `--method mtp` is the only supported speculative method in ATOM v0.1.2.

### Helps at conc=4
- `--level 3` + `--cudagraph-capture-sizes "[1,2,4,8]"` — tight cudagraph capture covering the small batch sizes seen at conc=4 with spec=3.
- `--num-speculative-tokens 3` (max for fp8 MLA path on this build — see Constraints).

### Confirmed Dead Knobs (regressions or crashes on this build)
| Knob | Effect | Reason |
|------|--------|--------|
| `--enable-dp-attention` | Tensor shape mismatch (20480 vs 16384) | v0.1.2 DP-attn bug |
| `--enable-expert-parallel` (without MoRI tuning) | MoRI symmetric heap OOM | Default heap = 2 GB |
| `--data-parallel-size > 1` | `recvBytes` / process group init failure | RCCL/MoRI conflict |
| `--enable_prefix_caching` | `NoneType.shape` per request | v0.1.2 bug |
| `--num-speculative-tokens ≥ 4` (fp8 KV) | C++ assert: `qo_len <= 4` | Hard cap in `asm_mla.cu:281` |
| `--kv_cache_dtype bf16` + `--num-speculative-tokens ≥ 4` | GSM8K = 0.05 (output broken) | DSR1 MTP head not trained for spec > 3 |
| `--kv_cache_dtype bf16` at TP=8 | ~25% throughput regression vs fp8 | 2× KV bandwidth |
| TP=4 with default cudagraph capture | GPU memory access fault (MoE) | TP=4 + batch=65 (GSM8K eval concurrency) outside captured graphs |
| AMD env stack alone (`HIP_FORCE_DEV_KERNARG=1`, `AITER_ENABLE_VSKIP=1`, `AMD_DIRECT_DISPATCH=1`, `GPU_MAX_HW_QUEUES=8`) | -40% at conc=128 | Co-tuned vars; need pairing with level=3/cudagraph |
| `--max-num-seqs > 256` | Crash | Session 012 confirmed |
| `--enforce-eager` | -77% | Cudagraph load-bearing |
| `--block-size 128` | -2% | Slight regression |

## Structural Ceiling — What We Could Not Solve

To exceed our submitted numbers we would need **higher speculative acceptance per forward step**. Two empirically-proven blockers prevent this on `rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2`:

1. **AITER fp8 MLA decode kernel hard-caps `qo_len ≤ 4`** (`/app/aiter-test/csrc/py_itfs_cu/asm_mla.cu:281`). The precompiled `.co` binaries in `/app/aiter-test/hsa/gfx950/mla/` only ship `qSeqLen ∈ {1, 2, 4}` for fp8. No source `.s` files are present to rebuild for qSeqLen > 4. This caps MTP at spec=3 in fp8.

2. **DSR1's single MTP layer (`num_nextn_predict_layers = 1`) was trained for spec=3**. Empirically tested: at TP=4 + bf16 + spec=4 we get GSM8K=0.0561 (broken). At spec=5: GSM8K=0.0508. Output collapses to random tokens above spec=3.

We additionally investigated unlocking EAGLE/NEXTN via SGLang v0.5.9-rocm700-mi35x, which would allow tree speculation with higher acceptance. We found **at least 3 cascading bugs** in SGLang's MTP+TP=8+MXFP4 load path:
- `channel_quant_to_tensor_quant` shape mismatch (`fp8_utils.py:1035`)
- `quark_post_load_weights` UnboundLocalError on fp8 input (`quark/utils.py:214`)
- `apply_fp8_linear` receives tuple instead of tensor (`fp8_utils.py:1105`)

Partial patches are in `sglang_patches/deepseek_weight_loader.py`. Full fix is multi-day work beyond the window.

## Prototype Work Product Beyond the Submission

We additionally built a **Triton fp8 MLA decode kernel** (`atom_patches/triton_mla_fp8_multi.py`) that supports arbitrary `qo_len` up to 8, intended to bypass the AITER kernel cap. It passes GSM8K (0.9447 at qo_len=4 baseline against the ASM kernel) but is **~8× slower than the ASM kernel** due to Triton's lack of native fp8 dot product support on AMD. With more time it could be optimized to be competitive. We include it for completeness.

## Methodology Notes
- All numbers are from `dsr1_benchmark perf` (the AMD-provided harness). Submissions used `dsr1_benchmark submit Jons`.
- Each run loads the model fresh, runs GSM8K validation, then runs the perf bench. Total ~12-15 minutes per run.
- Variance on c4_level3 has σ ≈ 14 tok/s/GPU around mean ~715. The 711 submission landed at the low end of variance; historical peak from this same config is 736 (May 11).
- Benchmark harness binary computes `tput_per_gpu = total_token_throughput / 8.0` (hardcoded — verified by `strings`).
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash
# c4 + --level 3 (max compilation level) + tight cudagraph + small max-num-seqs.
# Hypothesis: torch.compile level 3 may help small-batch decode kernels.

export AITER_ROOT_DIR=/projects/teamK/aiter_cache
export HF_HOME=/projects/teamK/hf_home
export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
export TRITON_CACHE_DIR=/projects/teamK/triton_cache
export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
export TMPDIR=/projects/teamK/tmp
export OMP_NUM_THREADS=1
export AMDGCN_USE_BUFFER_OPS=1
export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
export HOME=/projects/teamK/home_atom

python3 -m atom.entrypoints.openai_server \
--model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
--server-port 8888 -tp 8 \
--kv_cache_dtype fp8 \
--max-model-len 10240 \
--method mtp --num-speculative-tokens 3 \
--level 3 \
--cudagraph-capture-sizes "[1,2,4,8]" \
2>&1 | tee /projects/teamK/server_c4_level3.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
# c4_level3 with DSR1-MXFP4-MTP-MoEFP4 model (untested with ATOM).
# Different weights from base DSR1-MXFP4. Smaller (350GB vs 376GB).
export AITER_ROOT_DIR=/projects/teamK/aiter_cache
export HF_HOME=/projects/teamK/hf_home
export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
export TRITON_CACHE_DIR=/projects/teamK/triton_cache
export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
export TMPDIR=/projects/teamK/tmp
export OMP_NUM_THREADS=1
export AMDGCN_USE_BUFFER_OPS=1
export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
export HOME=/projects/teamK/home_atom

python3 -m atom.entrypoints.openai_server \
--model /share4/teamK/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
--server-port 8888 -tp 8 \
--kv_cache_dtype fp8 \
--max-model-len 10240 \
--method mtp --num-speculative-tokens 3 \
--level 3 \
--cudagraph-capture-sizes "[1,2,4,8]" \
2>&1 | tee /projects/teamK/server_c4_mtp_moefp4.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash
# tp8_spec3_vanilla + AMD env stack + larger max-num-batched-tokens.
# Hypothesis: bigger prefill batches = better TTFT amortization at conc=128.

export AITER_ROOT_DIR=/projects/teamK/aiter_cache
export HF_HOME=/projects/teamK/hf_home
export HF_MODULES_CACHE=/projects/teamK/hf_home/modules
export TRITON_CACHE_DIR=/projects/teamK/triton_cache
export TVM_FFI_CACHE_DIR=/projects/teamK/tvm_cache
export TMPDIR=/projects/teamK/tmp
export AMDGCN_USE_BUFFER_OPS=1
export VLLM_CACHE_ROOT=/projects/teamK/atom_cache
export HOME=/projects/teamK/home_atom
export OMP_NUM_THREADS=1

python3 -m atom.entrypoints.openai_server \
--model /share4/teamK/DeepSeek-R1-0528-MXFP4 \
--server-port 8888 -tp 8 \
--kv_cache_dtype fp8 \
--max-model-len 10240 \
--method mtp --num-speculative-tokens 3 \
--max-num-batched-tokens 131072 \
--max-num-seqs 256 \
2>&1 | tee /projects/teamK/server_tp8_spec3_bigbatch.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#!/bin/bash
# c4-only driver for MTP-MoEFP4 model — uses the correct model path in bench.
set -u
LAUNCHER="${1:?Usage: $0 <launcher_basename>}"
TS="$(date -u +%Y%m%dT%H%M%SZ)"
RUN_DIR="/projects/teamK/supreme-leader/runs/${TS}_c4_${LAUNCHER}"
LOG_DIR="${RUN_DIR}/logs"
mkdir -p "${LOG_DIR}"

LAUNCHER_FILE="/projects/teamK/supreme-leader/launch_atom_${LAUNCHER}.sh"
[ -f "${LAUNCHER_FILE}" ] || { echo "FATAL: launcher missing: ${LAUNCHER_FILE}"; exit 2; }

CONTAINER="atom-dsr1-dev"
IMAGE="rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2"
PORT=8888
DOCKER="/usr/local/bin/docker-teamK-unrestricted"

if ! "${DOCKER}" ps -a --format '{{.Names}}' | grep -q "^${CONTAINER}$"; then
DRI_FLAGS=()
for d in /dev/dri/*; do [ -c "$d" ] && DRI_FLAGS+=("--device=$d"); done
mkdir -p /projects/teamK/supreme-leader/dsr1_aiter_cache
chmod 777 /projects/teamK/supreme-leader/dsr1_aiter_cache
"${DOCKER}" run -d --name "${CONTAINER}" --ipc=host --shm-size=16g \
--device=/dev/kfd "${DRI_FLAGS[@]}" \
-v /share4:/share4 -v /projects/teamK:/projects/teamK -v /projects/teamK:/workspace \
-v /projects/teamK/supreme-leader/dsr1_aiter_cache:/root/.aiter \
-p ${PORT}:${PORT} \
"${IMAGE}" /bin/bash -c "sleep infinity"
fi

"${DOCKER}" exec "${CONTAINER}" bash -c '
mkdir -p /workspace/amdgpu_bounty_optimization
[ -e /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x ] || \
ln -sfn /workspace/supreme-leader/bench_atom /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
git config --global --add safe.directory /workspace/supreme-leader/bench_atom 2>/dev/null || true
'

LN="$(basename ${LAUNCHER_FILE})"
cp "${LAUNCHER_FILE}" "${RUN_DIR}/${LN}"
sed -i "s|tee /projects/teamK/server_.*\.log|tee /workspace/supreme-leader/runs/${TS}_c4_${LAUNCHER}/server.log|g" "${RUN_DIR}/${LN}"

echo "=== launching server"
"${DOCKER}" exec -d "${CONTAINER}" bash -c "
cd /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
bash /workspace/supreme-leader/runs/${TS}_c4_${LAUNCHER}/${LN}
"

SECONDS=0
HEALTHY=0
while [ $SECONDS -lt 1200 ]; do
curl -fsS "http://0.0.0.0:${PORT}/health" >/dev/null 2>&1 && { HEALTHY=1; break; }
if grep -qE "Out of symmetric heap|RuntimeError|proc died|All EngineCores shut down|memory access fault" "${RUN_DIR}/server.log" 2>/dev/null; then
if ! "${DOCKER}" exec "${CONTAINER}" pgrep -f atom.entrypoints >/dev/null 2>&1; then
echo "FATAL early. Last 20:"; tail -20 "${RUN_DIR}/server.log"
"${DOCKER}" rm -f "${CONTAINER}" >/dev/null 2>&1 || true
exit 6
fi
fi
sleep 15
[ $((SECONDS % 60)) -lt 15 ] && echo " [${SECONDS}s] waiting"
done
[ "${HEALTHY}" = "1" ] || { echo "FATAL: not healthy in 20m"; tail -30 "${RUN_DIR}/server.log"; exit 4; }
echo "=== healthy after ${SECONDS}s"

BENCH_LOG="${LOG_DIR}/bench_${LAUNCHER}_c4.log"
"${DOCKER}" exec "${CONTAINER}" bash -c "
cd /workspace/amdgpu_bounty_optimization/dsr1-fp4-atom-mtp-mi355x
export MODEL=/share4/teamK/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4
export PORT=${PORT}; export TP=8; export ISL=8192; export OSL=1024; export CONC=4
export RANDOM_RANGE_RATIO=1.0
export NUM_PROMPTS=40
export RESULT_FILENAME=c4_${LAUNCHER}_${TS}.json
export EP_SIZE=1; export DP_ATTENTION=0
export HF_HOME=/projects/teamK/hf_home
./dsr1_benchmark perf
" 2>&1 | tee "${BENCH_LOG}"

"${DOCKER}" exec "${CONTAINER}" bash -c "pkill -f 'atom.entrypoints.openai_server' || true"
sleep 5
"${DOCKER}" rm -f "${CONTAINER}" >/dev/null 2>&1 || true

echo "=== DONE ${LAUNCHER} c4"
grep -E "Total Token throughput|gsm8k_metric|Mean TPOT" "${BENCH_LOG}" 2>&1 | head -10
Loading