[NV] Update H100 Qwen3.5 SGLang agg config#1544
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
21edeff to
394e886
Compare
|
Claude finished @anish-shanbhag's task in 3m 54s —— View job Review of PR #1544
LGTM - no blocking issues found The PR cleanly restructures the H100 Qwen3.5 SGLang aggregate recipe to sweep across three parallelism tiers (TP8/EP1 for low concurrency, TP8/EP8 for mid, TP8/EP8+DP-attention for high). Verified:
|
There was a problem hiding this comment.
LGTM, straightforward benchmark recipe tuning.
Extended reasoning...
Overview
This PR tunes the Qwen3.5-397B-A17B-FP8 H100 SGLang aggregate recipe for both 1k/1k and 8k/1k sweeps. It replaces a single search-space row with three tiered rows (TP8/EP1 conc 1-8, TP8/EP8 conc 16-64, TP8/EP8 + DP-attention conc 128-256), updates the benchmark shell script to thread DP_ATTENTION/EP_SIZE through optional CLI args, adds per-concurrency --scheduler-recv-interval tuning, and bumps max-running-requests/chunked-prefill-size/mem-fraction-static back up from the conservative H200-derived defaults. A changelog entry is included.
Security risks
None. This is a benchmark configuration and shell script change in a non-production, single-node benchmark recipe. No auth, crypto, networking, or input-handling surface area is touched.
Level of scrutiny
Low. This is benchmark recipe tuning isolated to one model/hardware combo. The script changes follow the established DP_ATTENTION env-var pattern used by other recipes (e.g. dsv4_fp4_b200_vllm.sh), the dp-attn search-space key is already supported by utils/matrix_logic/generate_sweep_configs.py, and the YAML edit is local to the qwen3.5-fp8-h100-sglang block.
Other factors
The new case "$CONC" covers every concurrency value the search-space will generate for the non-DP branches (1, 2, 4, 8, 16, 32, 64) and explicitly errors on unsupported values, so silent misconfiguration is unlikely. No bugs were flagged by the bug hunting system, and the prior commit 394e886 on main is on this same tuning track.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26198417664 |
394e886 to
4587b6e
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26610973668 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals |
|
@anish-shanbhag Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26614204211 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
3 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296 |
213a1d2 to
c171323
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26916564786 |
c171323 to
b2e18a6
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26918589379 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26918589379 |
| --tokenizer-worker-num 6 \ | ||
| --mamba-ssm-dtype bfloat16 \ | ||
| --disable-radix-cache \ | ||
| --enable-symm-mem \ |
There was a problem hiding this comment.
thanks for the contribution! lgtm except for this isnt in sglang docs yet
There was a problem hiding this comment.
@functionstackx
here is the PR sgl-project/sglang#27296
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26975382615 |
Updates Qwen3.5-397B-A17B-FP8 H100 SGLang agg recipes with tuned configs for 1k/1k and 8k/1k.
Performance comparison
Source rows:
results_bmkartifacts from baseline 2026-05-18 CI run vs updated 2026-06-04 CI run.tok/s/userismedian_intvty.Apples-to-apples note: deltas compare non-MTP Qwen3.5 FP8 H100 SGLang rows only. The May artifact has non-MTP TP8/EP8 rows at conc 4, 8, 16, and 32 for both workloads; new-only concurrencies are marked
n/aon the May side.1k/1k matched geomean: +17.4% tok/s/gpu, +22.3% tok/s/user; 8k/1k matched geomean: +9.5% tok/s/gpu, +16.7% tok/s/user.
Note
Low Risk
Benchmark and CI YAML tuning only; no production serving, auth, or application runtime paths.
Overview
Retunes the Qwen3.5-397B-A17B-FP8 aggregate SGLang on H100 benchmark for 1k/1k and 8k/1k fixed-seq sweeps.
CI/config (
qwen3.5-fp8-h100-sglang): replaces a single TP8/EP8, conc 4–32 grid with TP8/EP1, conc 1–8 and TP8/EP8, conc 16–256 for both scenarios.Launch recipe (
qwen3.5_fp8_h100.sh): drops H100-vs-H200 memory caveats; applies expert parallel only whenEP_SIZE> 1; maps conc to--scheduler-recv-interval(with hard fail on unsupported conc); raisesmax-running-requests(64→256),chunked-prefill-size(8192→16384),mem-fraction-static(0.75→0.8); adds--enable-symm-memand passes scheduler args intosglang.launch_server.Documents the change in
perf-changelog.yaml.Reviewed by Cursor Bugbot for commit 50bf385. Bugbot is set up for automated code reviews on this repo. Configure here.