[NV] Update H100 Qwen3.5 SGLang agg config by anish-shanbhag · Pull Request #1544 · SemiAnalysisAI/InferenceX

anish-shanbhag · 2026-05-21T00:32:20Z

Updates Qwen3.5-397B-A17B-FP8 H100 SGLang agg recipes with tuned configs for 1k/1k and 8k/1k.

Performance comparison

Source rows: results_bmk artifacts from baseline 2026-05-18 CI run vs updated 2026-06-04 CI run. tok/s/user is median_intvty.

Apples-to-apples note: deltas compare non-MTP Qwen3.5 FP8 H100 SGLang rows only. The May artifact has non-MTP TP8/EP8 rows at conc 4, 8, 16, and 32 for both workloads; new-only concurrencies are marked n/a on the May side.

ISL/OSL	Conc	May non-MTP tok/s/gpu	New non-MTP tok/s/gpu	Delta tok/s/gpu	May non-MTP tok/s/user	New non-MTP tok/s/user	Delta tok/s/user
1k/1k	1	n/a	38.9	n/a	n/a	165.7	n/a
1k/1k	2	n/a	68.3	n/a	n/a	144.7	n/a
1k/1k	4	100.3	113.0	+12.6%	106.3	119.9	+12.7%
1k/1k	8	159.4	200.0	+25.5%	83.0	108.1	+30.3%
1k/1k	16	245.2	255.8	+4.4%	62.7	66.5	+6.0%
1k/1k	32	352.3	453.4	+28.7%	44.9	64.6	+43.8%
1k/1k	64	n/a	734.4	n/a	n/a	52.2	n/a
1k/1k	128	n/a	1055.5	n/a	n/a	39.2	n/a
1k/1k	256	n/a	1316.4	n/a	n/a	28.3	n/a
8k/1k	1	n/a	170.2	n/a	n/a	163.2	n/a
8k/1k	2	n/a	297.5	n/a	n/a	142.1	n/a
8k/1k	4	440.2	484.5	+10.1%	104.2	114.7	+10.1%
8k/1k	8	676.7	737.5	+9.0%	80.6	89.4	+10.9%
8k/1k	16	978.7	1030.5	+5.3%	56.5	60.4	+6.9%
8k/1k	32	1338.6	1526.3	+14.0%	38.6	54.9	+42.1%
8k/1k	64	n/a	2065.1	n/a	n/a	38.7	n/a
8k/1k	128	n/a	2338.1	n/a	n/a	23.6	n/a
8k/1k	256	n/a	2506.1	n/a	n/a	10.5	n/a

1k/1k matched geomean: +17.4% tok/s/gpu, +22.3% tok/s/user; 8k/1k matched geomean: +9.5% tok/s/gpu, +16.7% tok/s/user.

Note

Low Risk
Benchmark and CI YAML tuning only; no production serving, auth, or application runtime paths.

Overview
Retunes the Qwen3.5-397B-A17B-FP8 aggregate SGLang on H100 benchmark for 1k/1k and 8k/1k fixed-seq sweeps.

CI/config (qwen3.5-fp8-h100-sglang): replaces a single TP8/EP8, conc 4–32 grid with TP8/EP1, conc 1–8 and TP8/EP8, conc 16–256 for both scenarios.

Launch recipe (qwen3.5_fp8_h100.sh): drops H100-vs-H200 memory caveats; applies expert parallel only when EP_SIZE > 1; maps conc to --scheduler-recv-interval (with hard fail on unsupported conc); raises max-running-requests (64→256), chunked-prefill-size (8192→16384), mem-fraction-static (0.75→0.8); adds --enable-symm-mem and passes scheduler args into sglang.launch_server.

Documents the change in perf-changelog.yaml.

^{Reviewed by Cursor Bugbot for commit 50bf385. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-21T00:32:29Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-21T00:39:09Z

Claude finished @anish-shanbhag's task in 3m 54s —— View job

Review of PR #1544

Read changed files and gather context
Validate master config and perf-changelog
Review benchmark script changes
Post review summary

LGTM - no blocking issues found

The PR cleanly restructures the H100 Qwen3.5 SGLang aggregate recipe to sweep across three parallelism tiers (TP8/EP1 for low concurrency, TP8/EP8 for mid, TP8/EP8+DP-attention for high). Verified:

nvidia-master.yaml modified and perf-changelog.yaml updated with entry appended at the end ✅
EP is conditionally enabled via EP_SIZE > 1 (correct SGLang pattern) ✅
DP-attention handling (--dp-size "$TP" --enable-dp-attention) is consistent with other scripts (e.g., dsv4_fp4_b300_sglang_mtp.sh) ✅
Server launch command properly formatted with line continuations ✅
Container image lmsysorg/sglang:v0.5.12-cu130 is publicly accessible on Docker Hub ✅
case statement for scheduler-recv-interval covers all concurrency values reachable from the config search-space, with a safe * fallback that exits on unexpected values ✅

claude

LGTM, straightforward benchmark recipe tuning.

Extended reasoning...

Overview

This PR tunes the Qwen3.5-397B-A17B-FP8 H100 SGLang aggregate recipe for both 1k/1k and 8k/1k sweeps. It replaces a single search-space row with three tiered rows (TP8/EP1 conc 1-8, TP8/EP8 conc 16-64, TP8/EP8 + DP-attention conc 128-256), updates the benchmark shell script to thread DP_ATTENTION/EP_SIZE through optional CLI args, adds per-concurrency --scheduler-recv-interval tuning, and bumps max-running-requests/chunked-prefill-size/mem-fraction-static back up from the conservative H200-derived defaults. A changelog entry is included.

Security risks

None. This is a benchmark configuration and shell script change in a non-production, single-node benchmark recipe. No auth, crypto, networking, or input-handling surface area is touched.

Level of scrutiny

Low. This is benchmark recipe tuning isolated to one model/hardware combo. The script changes follow the established DP_ATTENTION env-var pattern used by other recipes (e.g. dsv4_fp4_b200_vllm.sh), the dp-attn search-space key is already supported by utils/matrix_logic/generate_sweep_configs.py, and the YAML edit is local to the qwen3.5-fp8-h100-sglang block.

Other factors

The new case "$CONC" covers every concurrency value the search-space will generate for the non-DP branches (1, 2, 4, 8, 16, 32, 64) and explicitly errors on unsupported values, so silent misconfiguration is unlikely. No bugs were flagged by the bug hunting system, and the prior commit 394e886 on main is on this same tuning track.

github-actions · 2026-05-21T02:36:28Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26198417664
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26198417664

github-actions · 2026-05-29T00:48:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26610973668
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26610973668

github-actions · 2026-05-29T01:36:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

github-actions · 2026-05-29T02:16:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

anish-shanbhag · 2026-05-29T02:23:59Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals

github-actions · 2026-05-29T02:24:07Z

@anish-shanbhag Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26614204211
Command: test-config --config-files .github/configs/nvidia-master.yaml --runner-node-filter h100-cw --config-keys qwen3.5-fp8-h100-sglang --conc 4 --seq-lens 8k1k --no-evals
Pinned ref: 213a1d2
Approval: not required (trusted collaborator).

github-actions · 2026-05-29T03:15:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

github-actions · 2026-05-29T03:27:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

github-actions · 2026-05-29T03:37:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

github-actions · 2026-06-03T21:59:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26611114296
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26611114296

github-actions · 2026-06-03T23:02:32Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26916564786
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26916564786

github-actions · 2026-06-03T23:45:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26918589379
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26918589379

github-actions · 2026-06-04T00:07:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26918589379
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26918589379

functionstackx · 2026-06-04T04:29:19Z

  --tokenizer-worker-num 6 \
  --mamba-ssm-dtype bfloat16 \
  --disable-radix-cache \
+  --enable-symm-mem \


thanks for the contribution! lgtm except for this isnt in sglang docs yet

@functionstackx
here is the PR sgl-project/sglang#27296

functionstackx · 2026-06-04T19:44:35Z

/reuse-sweep-run

github-actions · 2026-06-04T19:45:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26975382615
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26975382615

github-project-automation Bot added this to InferenceMAX Board May 21, 2026

anish-shanbhag changed the title ~~Tune H100 Qwen SGLang Pareto recipe~~ [NV] Update H100 Qwen3.5 SGLang agg config May 21, 2026

anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch 2 times, most recently from 21edeff to 394e886 Compare May 21, 2026 00:35

anish-shanbhag marked this pull request as ready for review May 21, 2026 00:38

anish-shanbhag requested a review from a team May 21, 2026 00:38

anish-shanbhag requested review from jgangani and kedarpotdar-nv as code owners May 21, 2026 00:38

anish-shanbhag added full-sweep-enabled NVIDIA labels May 21, 2026

claude Bot reviewed May 21, 2026

View reviewed changes

anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch from 394e886 to 4587b6e Compare May 29, 2026 00:42

anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch from 213a1d2 to c171323 Compare June 3, 2026 22:19

anish-shanbhag added 3 commits June 3, 2026 16:04

Tune H100 Qwen SGLang Pareto recipe

a248e16

Use TEP for Qwen H100 high concurrency

ec004e5

Simplify Qwen H100 TEP sweep config

b2e18a6

anish-shanbhag force-pushed the codex/qwen35-h100-sglang-pareto-upstream branch from c171323 to b2e18a6 Compare June 3, 2026 23:04

kedarpotdar-nv approved these changes Jun 4, 2026

View reviewed changes

functionstackx requested changes Jun 4, 2026

View reviewed changes

Merge branch 'main' into codex/qwen35-h100-sglang-pareto-upstream

50bf385

functionstackx merged commit ea4f575 into main Jun 4, 2026
4 of 6 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 4, 2026

functionstackx deleted the codex/qwen35-h100-sglang-pareto-upstream branch June 4, 2026 19:45

claude Bot mentioned this pull request Jun 4, 2026

minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN #1668

Merged

Conversation

anish-shanbhag commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance comparison

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

claude Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1544

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

anish-shanbhag commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

functionstackx Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Ankur-singh Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Jun 4, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anish-shanbhag commented May 21, 2026 •

edited

Loading

claude Bot commented May 21, 2026 •

edited

Loading