[ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang by wanzhenchn · Pull Request #700 · ROCm/ATOM

wanzhenchn · 2026-05-06T09:21:06Z

Motivation

Add Qwen3.5 coverage to ATOM SGLang CI / nightly accuracy / benchmark flows.
Align Qwen3.5 SGLang launch args with validated local commands.
Add Qwen3.5-397B-A17B-FP8 TP4 / TP8 benchmark cases.
Rotate scheduled benchmark groups to avoid running all benchmark cases every night.
Update recipes/atom_sglang/Qwen3_5.md with server, benchmark, and GSM8K commands.

ATOM SGLang CI / Nightly / Benchmark Scope

CI

Item	Value
Workflow	`.github/workflows/atom-sglang-test.yaml`
Trigger	PR to `main`, non-draft, non-closed
Purpose	PR-level SGLang GSM8K accuracy smoke validation

Model	Weight	Runner	TP	Threshold
DeepSeek-R1-FP8 TP4	`deepseek-ai/DeepSeek-R1-0528`	`linux-atom-mi35x-4`	4	`0.91`
DeepSeek-R1-FP4 TP4	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`linux-atom-mi35x-4`	4	`0.91`
Qwen3.5-35B-A3B-FP8 TP2	`Qwen/Qwen3.5-35B-A3B-FP8`	`linux-atom-mi35x-4`	2	`0.76`

Nightly Accuracy

Item	Value
Workflow	`.github/workflows/atom-sglang-accuracy-validation.yaml`
Trigger	Nightly `18:00 UTC` / Beijing `02:00`, or manual dispatch
Task	`gsm8k`
Metric	`results.gsm8k["exact_match,flexible-extract"]`
Few-shot	`3`
LM Eval concurrency	`65`
LM Eval retries	`1`
SGLang ref	`v0.5.10`

Model	Weight	Runner	TP	Threshold
DeepSeek-R1-FP8 TP4	`deepseek-ai/DeepSeek-R1-0528`	`linux-atom-mi35x-4`	4	`0.91`
DeepSeek-R1-FP8 TP8	`deepseek-ai/DeepSeek-R1-0528`	`linux-atom-mi35x-8`	8	`0.93`
DeepSeek-R1-FP4 TP4	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`linux-atom-mi35x-4`	4	`0.91`
DeepSeek-R1-FP4 TP8	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`linux-atom-mi35x-8`	8	`0.93`
Qwen3.5-35B-A3B-FP8 TP2	`Qwen/Qwen3.5-35B-A3B-FP8`	`linux-atom-mi35x-4`	2	`0.76`
Qwen3.5-35B-A3B TP2	`Qwen/Qwen3.5-35B-A3B`	`linux-atom-mi35x-4`	2	`0.83`
Qwen3.5-397B-A17B-FP8 TP4	`Qwen/Qwen3.5-397B-A17B-FP8`	`linux-atom-mi35x-4`	4	`0.83`
Qwen3.5-397B-A17B-FP8 TP8	`Qwen/Qwen3.5-397B-A17B-FP8`	`linux-atom-mi35x-8`	8	`0.83`

Server Args

Model Family	Default Server Args	Extra Args	Env
DeepSeek	`--trust-remote-code --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.8 --page-size 1 --disable-radix-cache`	`--tensor-parallel-size <tp>`; EP case adds `--expert-parallel-size 8`	`AITER_QUICK_REDUCE_QUANTIZATION=INT4`, `SGLANG_AITER_FP8_PREFILL_ATTN=0`, `SGLANG_USE_AITER=1`, `ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1`
Qwen3.5	empty via `SGLANG_DEFAULT_SERVER_ARGS=`	`--tensor-parallel-size <tp> --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache`	`SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models`, `ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0`

Nightly Benchmark

Item	Value
Workflow	`.github/workflows/atom-sglang-benchmark.yaml`
Trigger	Weekday `15:00 UTC` / Beijing `23:00`, or manual dispatch
Manual selection	Checkbox-based model selection
Param override	`param_lists`
Dashboard upload	`publish_to_dashboard`

Schedule Group	Beijing Day	Models	Default Case Count
`A-DEEPSEEK`	Monday / Wednesday	5 DeepSeek benchmark models	`5 × 10 = 50`
`B-QWEN35`	Tuesday / Thursday	2 Qwen3.5-397B benchmark models	`2 × 10 = 20`
`C-ALL`	Friday	All benchmark models	`7 × 10 = 70`

ISL	OSL	Concurrency	Random Range Ratio
1024	1024	`4, 8, 16, 32, 64`	`0.8`
8192	1024	`4, 8, 16, 32, 64`	`0.8`

Model	Weight	Runner	Serve Args
DeepSeek-R1-0528 FP8 TP8	`deepseek-ai/DeepSeek-R1-0528`	`atom-mi355-8gpu-aac-runner`	`--trust-remote-code --tensor-parallel-size 8`
DeepSeek-R1-0528 FP8 TP4	`deepseek-ai/DeepSeek-R1-0528`	`atom-mi355-8gpu-aac-runner`	`--trust-remote-code --tensor-parallel-size 4`
DeepSeek-R1-0528-MXFP4 FP4 TP8	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`atom-mi355-8gpu-aac-runner`	`--trust-remote-code --tensor-parallel-size 8`
DeepSeek-R1-0528-MXFP4 FP4 TP4	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`atom-mi355-8gpu-aac-runner`	`--trust-remote-code --tensor-parallel-size 4`
DeepSeek-R1-0528-MXFP4 FP4 TP8 EP8	`amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4`	`atom-mi355-8gpu-aac-runner`	`--trust-remote-code --tensor-parallel-size 8 --expert-parallel-size 8`
Qwen3.5-397B-A17B-FP8 TP4	`Qwen/Qwen3.5-397B-A17B-FP8`	`atom-mi355-8gpu-aac-runner`	`--tensor-parallel-size 4 --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache`
Qwen3.5-397B-A17B-FP8 TP8	`Qwen/Qwen3.5-397B-A17B-FP8`	`atom-mi355-8gpu-aac-runner`	`--tensor-parallel-size 8 --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache`

zhuyuhua-v · 2026-05-12T03:05:50Z

Since we adding Qwen3.5-397B-A17B-FP8 TP4/TP8 for benchmark, to ensure the benchmark cases' acc, how about add these benchmark model configs in nightly check? We need to ensure a full case cover in nightly

…ed sglang

…I35X

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Copilot

Pull request overview

This PR expands ATOM’s SGLang CI and nightly pipelines to cover additional Qwen3.5 Dense/MoE models, aligning launch/accuracy/benchmark settings and adding scheduled benchmark rotation to reduce nightly load.

Changes:

Add Qwen3.5 model coverage to PR CI GSM8K smoke tests and nightly GSM8K accuracy validation.
Add Qwen3.5-397B benchmark model configs and implement weekday-based benchmark group rotation (DeepSeek / Qwen / All).
Add support for overriding default SGLang server args via SGLANG_DEFAULT_SERVER_ARGS in the shared SGLang test script.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
recipes/atom_sglang/Qwen3_5.md	Updates example benchmark + GSM8K commands for Qwen3.5 workflows.
.github/workflows/atom-sglang-test.yaml	Adds Qwen3.5-35B-A3B-FP8 TP2 to PR-level SGLang GSM8K smoke CI.
.github/workflows/atom-sglang-benchmark.yaml	Adds Qwen3.5 benchmark toggles and rotates scheduled benchmark model groups by weekday.
.github/workflows/atom-sglang-accuracy-validation.yaml	Adds Qwen3.5 models to nightly accuracy matrix and to manual dispatch toggles.
.github/scripts/atom_sglang_test.sh	Introduces `SGLANG_DEFAULT_SERVER_ARGS` to control baseline server args per model family.
.github/benchmark/sglang_models_accuracy.json	Adds Qwen3.5 entries for dashboard accuracy thresholds/baselines.
.github/benchmark/sglang_benchmark_models.json	Adds Qwen3.5-397B benchmark models and nightly grouping metadata.

Comments suppressed due to low confidence (1)

recipes/atom_sglang/Qwen3_5.md:92

In the GSM8K lm_eval example, base_url=http://localhost:30000/... doesn’t match the server port shown earlier in this doc (--port 8000). To avoid confusion, the accuracy command should use the same port as the server launch command (or the server command should be updated to match).

lm_eval --model local-completions \
        --model_args model=${model_path},base_url=http://localhost:30000/v1/completions,num_concurrent=65,max_retries=1,tokenized_requests=False,trust_remote_code=True \
        --tasks gsm8k \
        --num_fewshot 3 \
        --trust_remote_code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 python3 -m sglang.bench_serving --backend sglang-oai-chat \
    --model ${model_path} \
    --base-url=http://127.0.0.1:30000 \
-    --max-concurrency 16 \ 
-    --num-prompts "$(( CONC * 5 ))" \ 
+    --max-concurrency 16 \
+    --num-prompts "$(( CONC * 5 ))" \


wanzhenchn requested review from Yuechguo, wuhuikx and zhuyuhua-v May 6, 2026 09:21

wanzhenchn force-pushed the ci/support_qwen3.5 branch 3 times, most recently from ea598c1 to 82c8443 Compare May 7, 2026 02:06

zufayu mentioned this pull request May 8, 2026

[Bug]: Qwen3.5-35B-A3B / 27B BF16 accuracy regression at TP4 / TP8 #719

Open

9 tasks

wanzhenchn force-pushed the ci/support_qwen3.5 branch 2 times, most recently from 24175af to 69f279f Compare May 8, 2026 10:05

wanzhenchn changed the title ~~[ci] add Qwen3.5 Dense/MoE models accuracy validation for atom-plugined sglang~~ [ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang May 8, 2026

wanzhenchn force-pushed the ci/support_qwen3.5 branch 6 times, most recently from ea782c5 to cff05ee Compare May 12, 2026 03:00

wanzhenchn added 2 commits May 12, 2026 15:22

[ci] add Qwen3.5 Dense/MoE models accuracy validation for atom-plugin…

b3432ed

…ed sglang

[ci][benchmark] add Qwen3.5-397B-A13B-FP8 TP4/TP8 benchmark case on M…

248a91a

…I35X

wanzhenchn force-pushed the ci/support_qwen3.5 branch from cff05ee to 248a91a Compare May 12, 2026 07:22

wanzhenchn and others added 4 commits May 12, 2026 07:26

[doc] fix qwen3.5 recipe for atom_sglang

6afd51f

update aiter whl download flow

5854747

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Merge branch 'main' into ci/support_qwen3.5

b8acc66

Merge branch 'main' into ci/support_qwen3.5

4bfd7ab

zhuyuhua-v marked this pull request as draft May 13, 2026 07:22

zhuyuhua-v added 2 commits May 13, 2026 05:13

update auto benchmark and add more cases

d39fd50

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

Merge branch 'main' into ci/support_qwen3.5

2f62bad

zhuyuhua-v marked this pull request as ready for review May 13, 2026 10:20

Copilot AI review requested due to automatic review settings May 13, 2026 10:20

Copilot started reviewing on behalf of zhuyuhua-v May 13, 2026 10:21 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang#700

[ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang#700
wanzhenchn wants to merge 8 commits into
ROCm:mainfrom
wanzhenchn:ci/support_qwen3.5

wanzhenchn commented May 6, 2026 •

edited by zhuyuhua-v

Loading

Uh oh!

zhuyuhua-v commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wanzhenchn commented May 6, 2026 • edited by zhuyuhua-v Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

ATOM SGLang CI / Nightly / Benchmark Scope

CI

Nightly Accuracy

Server Args

Nightly Benchmark

Uh oh!

zhuyuhua-v commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanzhenchn commented May 6, 2026 •

edited by zhuyuhua-v

Loading