Skip to content

[ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang#700

Open
wanzhenchn wants to merge 8 commits into
ROCm:mainfrom
wanzhenchn:ci/support_qwen3.5
Open

[ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang#700
wanzhenchn wants to merge 8 commits into
ROCm:mainfrom
wanzhenchn:ci/support_qwen3.5

Conversation

@wanzhenchn
Copy link
Copy Markdown
Contributor

@wanzhenchn wanzhenchn commented May 6, 2026

Motivation

  • Add Qwen3.5 coverage to ATOM SGLang CI / nightly accuracy / benchmark flows.
  • Align Qwen3.5 SGLang launch args with validated local commands.
  • Add Qwen3.5-397B-A17B-FP8 TP4 / TP8 benchmark cases.
  • Rotate scheduled benchmark groups to avoid running all benchmark cases every night.
  • Update recipes/atom_sglang/Qwen3_5.md with server, benchmark, and GSM8K commands.

ATOM SGLang CI / Nightly / Benchmark Scope

CI

Item Value
Workflow .github/workflows/atom-sglang-test.yaml
Trigger PR to main, non-draft, non-closed
Purpose PR-level SGLang GSM8K accuracy smoke validation
Model Weight Runner TP Threshold
DeepSeek-R1-FP8 TP4 deepseek-ai/DeepSeek-R1-0528 linux-atom-mi35x-4 4 0.91
DeepSeek-R1-FP4 TP4 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 linux-atom-mi35x-4 4 0.91
Qwen3.5-35B-A3B-FP8 TP2 Qwen/Qwen3.5-35B-A3B-FP8 linux-atom-mi35x-4 2 0.76

Nightly Accuracy

Item Value
Workflow .github/workflows/atom-sglang-accuracy-validation.yaml
Trigger Nightly 18:00 UTC / Beijing 02:00, or manual dispatch
Task gsm8k
Metric results.gsm8k["exact_match,flexible-extract"]
Few-shot 3
LM Eval concurrency 65
LM Eval retries 1
SGLang ref v0.5.10
Model Weight Runner TP Threshold
DeepSeek-R1-FP8 TP4 deepseek-ai/DeepSeek-R1-0528 linux-atom-mi35x-4 4 0.91
DeepSeek-R1-FP8 TP8 deepseek-ai/DeepSeek-R1-0528 linux-atom-mi35x-8 8 0.93
DeepSeek-R1-FP4 TP4 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 linux-atom-mi35x-4 4 0.91
DeepSeek-R1-FP4 TP8 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 linux-atom-mi35x-8 8 0.93
Qwen3.5-35B-A3B-FP8 TP2 Qwen/Qwen3.5-35B-A3B-FP8 linux-atom-mi35x-4 2 0.76
Qwen3.5-35B-A3B TP2 Qwen/Qwen3.5-35B-A3B linux-atom-mi35x-4 2 0.83
Qwen3.5-397B-A17B-FP8 TP4 Qwen/Qwen3.5-397B-A17B-FP8 linux-atom-mi35x-4 4 0.83
Qwen3.5-397B-A17B-FP8 TP8 Qwen/Qwen3.5-397B-A17B-FP8 linux-atom-mi35x-8 8 0.83

Server Args

Model Family Default Server Args Extra Args Env
DeepSeek --trust-remote-code --kv-cache-dtype fp8_e4m3 --mem-fraction-static 0.8 --page-size 1 --disable-radix-cache --tensor-parallel-size <tp>; EP case adds --expert-parallel-size 8 AITER_QUICK_REDUCE_QUANTIZATION=INT4, SGLANG_AITER_FP8_PREFILL_ATTN=0, SGLANG_USE_AITER=1, ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1
Qwen3.5 empty via SGLANG_DEFAULT_SERVER_ARGS= --tensor-parallel-size <tp> --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models, ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0

Nightly Benchmark

Item Value
Workflow .github/workflows/atom-sglang-benchmark.yaml
Trigger Weekday 15:00 UTC / Beijing 23:00, or manual dispatch
Manual selection Checkbox-based model selection
Param override param_lists
Dashboard upload publish_to_dashboard
Schedule Group Beijing Day Models Default Case Count
A-DEEPSEEK Monday / Wednesday 5 DeepSeek benchmark models 5 × 10 = 50
B-QWEN35 Tuesday / Thursday 2 Qwen3.5-397B benchmark models 2 × 10 = 20
C-ALL Friday All benchmark models 7 × 10 = 70
ISL OSL Concurrency Random Range Ratio
1024 1024 4, 8, 16, 32, 64 0.8
8192 1024 4, 8, 16, 32, 64 0.8
Model Weight Runner Serve Args
DeepSeek-R1-0528 FP8 TP8 deepseek-ai/DeepSeek-R1-0528 atom-mi355-8gpu-aac-runner --trust-remote-code --tensor-parallel-size 8
DeepSeek-R1-0528 FP8 TP4 deepseek-ai/DeepSeek-R1-0528 atom-mi355-8gpu-aac-runner --trust-remote-code --tensor-parallel-size 4
DeepSeek-R1-0528-MXFP4 FP4 TP8 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 atom-mi355-8gpu-aac-runner --trust-remote-code --tensor-parallel-size 8
DeepSeek-R1-0528-MXFP4 FP4 TP4 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 atom-mi355-8gpu-aac-runner --trust-remote-code --tensor-parallel-size 4
DeepSeek-R1-0528-MXFP4 FP4 TP8 EP8 amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 atom-mi355-8gpu-aac-runner --trust-remote-code --tensor-parallel-size 8 --expert-parallel-size 8
Qwen3.5-397B-A17B-FP8 TP4 Qwen/Qwen3.5-397B-A17B-FP8 atom-mi355-8gpu-aac-runner --tensor-parallel-size 4 --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache
Qwen3.5-397B-A17B-FP8 TP8 Qwen/Qwen3.5-397B-A17B-FP8 atom-mi355-8gpu-aac-runner --tensor-parallel-size 8 --mem-fraction-static 0.9 --reasoning-parser qwen3 --disable-radix-cache

@wanzhenchn wanzhenchn force-pushed the ci/support_qwen3.5 branch 3 times, most recently from ea598c1 to 82c8443 Compare May 7, 2026 02:06
@wanzhenchn wanzhenchn force-pushed the ci/support_qwen3.5 branch 2 times, most recently from 24175af to 69f279f Compare May 8, 2026 10:05
@wanzhenchn wanzhenchn changed the title [ci] add Qwen3.5 Dense/MoE models accuracy validation for atom-plugined sglang [ci] add Qwen3.5 Dense/MoE models accuracy validation and benchmark tests for atom-plugined sglang May 8, 2026
@wanzhenchn wanzhenchn force-pushed the ci/support_qwen3.5 branch 6 times, most recently from ea782c5 to cff05ee Compare May 12, 2026 03:00
@zhuyuhua-v
Copy link
Copy Markdown
Collaborator

Since we adding Qwen3.5-397B-A17B-FP8 TP4/TP8 for benchmark, to ensure the benchmark cases' acc, how about add these benchmark model configs in nightly check? We need to ensure a full case cover in nightly

@wanzhenchn wanzhenchn force-pushed the ci/support_qwen3.5 branch from cff05ee to 248a91a Compare May 12, 2026 07:22
@zhuyuhua-v zhuyuhua-v marked this pull request as draft May 13, 2026 07:22
@zhuyuhua-v zhuyuhua-v marked this pull request as ready for review May 13, 2026 10:20
Copilot AI review requested due to automatic review settings May 13, 2026 10:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands ATOM’s SGLang CI and nightly pipelines to cover additional Qwen3.5 Dense/MoE models, aligning launch/accuracy/benchmark settings and adding scheduled benchmark rotation to reduce nightly load.

Changes:

  • Add Qwen3.5 model coverage to PR CI GSM8K smoke tests and nightly GSM8K accuracy validation.
  • Add Qwen3.5-397B benchmark model configs and implement weekday-based benchmark group rotation (DeepSeek / Qwen / All).
  • Add support for overriding default SGLang server args via SGLANG_DEFAULT_SERVER_ARGS in the shared SGLang test script.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
recipes/atom_sglang/Qwen3_5.md Updates example benchmark + GSM8K commands for Qwen3.5 workflows.
.github/workflows/atom-sglang-test.yaml Adds Qwen3.5-35B-A3B-FP8 TP2 to PR-level SGLang GSM8K smoke CI.
.github/workflows/atom-sglang-benchmark.yaml Adds Qwen3.5 benchmark toggles and rotates scheduled benchmark model groups by weekday.
.github/workflows/atom-sglang-accuracy-validation.yaml Adds Qwen3.5 models to nightly accuracy matrix and to manual dispatch toggles.
.github/scripts/atom_sglang_test.sh Introduces SGLANG_DEFAULT_SERVER_ARGS to control baseline server args per model family.
.github/benchmark/sglang_models_accuracy.json Adds Qwen3.5 entries for dashboard accuracy thresholds/baselines.
.github/benchmark/sglang_benchmark_models.json Adds Qwen3.5-397B benchmark models and nightly grouping metadata.
Comments suppressed due to low confidence (1)

recipes/atom_sglang/Qwen3_5.md:92

  • In the GSM8K lm_eval example, base_url=http://localhost:30000/... doesn’t match the server port shown earlier in this doc (--port 8000). To avoid confusion, the accuracy command should use the same port as the server launch command (or the server command should be updated to match).
lm_eval --model local-completions \
        --model_args model=${model_path},base_url=http://localhost:30000/v1/completions,num_concurrent=65,max_retries=1,tokenized_requests=False,trust_remote_code=True \
        --tasks gsm8k \
        --num_fewshot 3 \
        --trust_remote_code

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 55 to +59
python3 -m sglang.bench_serving --backend sglang-oai-chat \
--model ${model_path} \
--base-url=http://127.0.0.1:30000 \
--max-concurrency 16 \
--num-prompts "$(( CONC * 5 ))" \
--max-concurrency 16 \
--num-prompts "$(( CONC * 5 ))" \
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants