Skip to content

Commit e8ffa83

Browse files
ChangLiu0709claude
andcommitted
[AMD][ROCM] Add MI355X config: glm5-fp4-mi355x-sglang-mtp
- Add `glm5-fp4-mi355x-sglang-mtp` config to amd-master.yaml. - Add benchmarks/single_node/glm5_fp4_mi355x_mtp.sh launch script. - Image: lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428 - Model: amd/GLM-5-MXFP4 (TP=8, FP4/quark quantization) - EAGLE MTP speculative decoding: num-steps=3, eagle-topk=1, num-draft-tokens=4, behind SGLANG_ENABLE_SPEC_V2=1 - Search space: 1k1k and 8k1k, conc 4-64, spec-decoding=mtp - Append perf-changelog.yaml entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent eab567a commit e8ffa83

3 files changed

Lines changed: 108 additions & 0 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,24 @@ glm5-fp8-mi355x-sglang-mtp:
381381
search-space:
382382
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
383383

384+
glm5-fp4-mi355x-sglang-mtp:
385+
image: lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428
386+
model: amd/GLM-5-MXFP4
387+
model-prefix: glm5
388+
runner: mi355x
389+
precision: fp4
390+
framework: sglang
391+
multinode: false
392+
seq-len-configs:
393+
- isl: 1024
394+
osl: 1024
395+
search-space:
396+
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
397+
- isl: 8192
398+
osl: 1024
399+
search-space:
400+
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
401+
384402
glm5-fp8-mi355x-atom:
385403
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
386404
model: zai-org/GLM-5-FP8
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#!/usr/bin/env bash
2+
3+
source "$(dirname "$0")/../benchmark_lib.sh"
4+
5+
check_env_vars \
6+
MODEL \
7+
TP \
8+
CONC \
9+
ISL \
10+
OSL \
11+
RANDOM_RANGE_RATIO \
12+
RESULT_FILENAME
13+
14+
if [[ -n "$SLURM_JOB_ID" ]]; then
15+
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
16+
fi
17+
18+
hf download "$MODEL"
19+
20+
export SGLANG_ENABLE_SPEC_V2=1
21+
22+
SERVER_LOG=/workspace/server.log
23+
PORT=${PORT:-8888}
24+
25+
EVAL_CONTEXT_ARGS=""
26+
if [ "${EVAL_ONLY}" = "true" ]; then
27+
setup_eval_context
28+
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
29+
fi
30+
# Start GPU monitoring (power, temperature, clocks every second)
31+
start_gpu_monitor
32+
33+
python3 -m sglang.launch_server \
34+
--model-path $MODEL \
35+
--host=0.0.0.0 \
36+
--port $PORT \
37+
--trust-remote-code \
38+
--tp $TP \
39+
--chunked-prefill-size 131072 \
40+
--disable-radix-cache \
41+
--mem-fraction-static 0.85 \
42+
--model-loader-extra-config '{"enable_multithread_load": true}' \
43+
--watchdog-timeout 1200 \
44+
--reasoning-parser glm45 \
45+
--tool-call-parser glm47 \
46+
--speculative-algorithm EAGLE \
47+
--speculative-num-steps 3 \
48+
--speculative-eagle-topk 1 \
49+
--speculative-num-draft-tokens 4 \
50+
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
51+
52+
SERVER_PID=$!
53+
54+
# Wait for server to be ready
55+
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
56+
57+
run_benchmark_serving \
58+
--model "$MODEL" \
59+
--port "$PORT" \
60+
--backend vllm \
61+
--input-len "$ISL" \
62+
--output-len "$OSL" \
63+
--random-range-ratio "$RANDOM_RANGE_RATIO" \
64+
--num-prompts "$((CONC * 10))" \
65+
--max-concurrency "$CONC" \
66+
--result-filename "$RESULT_FILENAME" \
67+
--result-dir /workspace/ \
68+
--use-chat-template
69+
70+
# After throughput, run evaluation only if RUN_EVAL is true
71+
if [ "${RUN_EVAL}" = "true" ]; then
72+
run_eval --framework lm-eval --port "$PORT"
73+
append_lm_eval_summary
74+
fi
75+
76+
# Stop GPU monitoring
77+
stop_gpu_monitor
78+
set +x

perf-changelog.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2069,3 +2069,15 @@
20692069
- "Recipes cover 8k/1k aggregate TP8 low-latency conc=1, low-latency bridge 1P DEP8 + 4D TP8 no-offload conc=16/32/64, mid 1P/1D DEP8 MegaMOE conc=128, and high-throughput 2P/1D DEP8 MegaMOE conc=1024"
20702070
- "All recipes enable FP4 indexer cache and speculative-config mtp with num_speculative_tokens=2"
20712071
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1242
2072+
2073+
- config-keys:
2074+
- glm5-fp4-mi355x-sglang-mtp
2075+
description:
2076+
- "Add GLM-5 MXFP4 MI355X SGLang MTP benchmark"
2077+
- "Image: lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi35x-20260428"
2078+
- "Model: amd/GLM-5-MXFP4"
2079+
- "EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
2080+
- "Image ships transformers with glm_moe_dsa support, so no extra pip install is needed (unlike glm5-fp8-mi355x-sglang)"
2081+
- "Configs: 1k1k and 8k1k, TP=8 conc 4-64 with spec-decoding=mtp"
2082+
- "Requires benchmark_serving.py tokenizer fix: https://github.com/SemiAnalysisAI/InferenceX/pull/1253"
2083+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1254

0 commit comments

Comments
 (0)