Skip to content

Commit 4007906

Browse files
ch-wanclaudeFridge003Oseltamivir
authored
(radixark sgl maintainer submission): Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks (#1297)
* add mtp configs * Add sbatch_directives to MTP recipes (root-cause fix) Without `cpus-per-task: 144` and `mem: 0`, slurm hands out 1 CPU and ~4 MB per task, and the dynamo cold source build (~500 rust crates) is OOM-killed before any worker comes up. Manifests as `Sweep failed (exit code: 137)` ~30 s after orchestrator start. Mirrors the block already present in the working main 8k1k recipes (e.g. disagg-gb300-1p1d-tp4-tp4-2-c1.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Change deepgemm flags * Move MTP recipes up to 8k1k/ with -mtp filename suffix Mirrors the convention used elsewhere in the repo: per-config files at the same depth as their non-MTP siblings, distinguished only by the -mtp suffix. CONFIG_FILE references in nvidia-master.yaml updated accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix * Drop custom_tokenizer from MTP recipes — incompatible with sa-bench sa-bench's calculate_metrics calls `tokenizer(text)` to count output tokens, but `sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer` doesn't implement __call__: TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable File "/srtctl-benchmarks/sa-bench/benchmark_serving.py", line 657 num_tokens = len(tokenizer(output.text_chunks[i], ...).input_ids) This is the actual cause of the benchmark-task failures while eval-only tasks succeed (lm-eval doesn't go through this path). Removing custom_tokenizer falls back to AutoTokenizer.from_pretrained(/model). The chat_template is stored in the model's tokenizer_config.json, so `use_chat_template: true` continues to apply via the HF tokenizer (required for MTP correctness per AGENTS.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin srt-slurm to fork w/ SGLangDeepseekV4Tokenizer callable + restore custom_tokenizer NVIDIA/srt-slurm#144 adds __call__ / __getattr__ to SGLangDeepseekV4Tokenizer so sa-bench's calculate_metrics (benchmark_serving.py:657 — `tokenizer(text).input_ids`) can count generated tokens for DSv4-Pro multi-node MTP runs without throwing ``TypeError: 'SGLangDeepseekV4Tokenizer' object is not callable``. Until that PR merges, pin gb300-cw's sglang launcher to ``ch-wan/srt-slurm @ c901ad38`` (the same fix), and restore ``custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer`` in the 6 MTP recipes. ``use_chat_template: true`` is required by AGENTS.md for MTP correctness (EAGLE acceptance regresses on raw random tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260508-2cf1a4ab (latest main) Pinned to the multi-arch image produced by sgl-project/sglang Build and Push Development Docker Images run #25574279419 (head_sha 2cf1a4ab, HEAD of sglang main). Replaces the older staging image lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev (May 7). The nightly-dev-cu13 image carries the full sglang main as of 2026-05-08 21:06 UTC, including upstream fixes since the May-7 staging snapshot. Multi-arch manifest covers amd64 + arm64, so it works on the gb300 (Grace) compute nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Restore base dsv4-fp4-gb300-dynamo-sglang image to staging tag The previous commit accidentally bumped the non-MTP base entry's image too. The base 8k1k recipes still pin ``container: lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev``, and the launcher requires the matrix's ``image:`` to match the recipe's ``container:`` (it templates ``\"\${IMAGE}\": \${SQUASH_FILE}`` into srtslurm.yaml). Mismatching them would break the base sweep. Only the dsv4-fp4-gb300-dynamo-sglang-mtp entry needs the nightly-dev-cu13 bump (paired with the MTP recipe ``container:`` field). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin MTP recipes to dynamo 81d0555e (matches working base recipes) The 6 MTP recipes were imported with dynamo hash 9d3c913d from the upstream srt-slurm fork, but the working non-MTP base recipes already on this branch use 81d0555ee23519cea80a42b4fe824e30368b7300 — paired with the sglang nightly cu13 main builds. The 9d3c913d wheel is incompatible with sglang main 2cf1a4ab: the decode scheduler subprocess (rank 0) is SIGQUIT'd during sgl.Engine() init at dynamo.sglang.init_llm:77, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in CI run 25580956722. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in multi-node decode MTP recipes The 4 multi-node decode MTP recipes had a comment saying SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 was "intentionally NOT set", but sglang main 2cf1a4ab defaults this on. CAR_V2 is single-node only, and on multi-node decode it silently fails to construct its backing ``self.obj``, then segfaults during cuda graph capture: AttributeError: 'CustomAllReduceV2' object has no attribute 'obj' at custom_all_reduce_v2.py:97 in capture() The scheduler is SIGQUIT'd, surfacing as "Rank 0 scheduler died during initialization (exit code: -3)" in dynamo's wrapper. Explicitly setting the env to "0" matches the intent of the pre-existing comment. Affects: dep4-dep8, dep4-dep16, 2p1d-dep4-dep8, 4p1d-dep4-dep8. Single-node decode recipes (1p1d-tp4-tp4, 1p6d-dep4-tp4) keep the default since CAR_V2 works in single-node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Explicitly disable CAR_V2 in 8k1k base decode recipes too Apply the same explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0"`` to the existing 8k1k base decode recipes that had only the ``intentionally NOT set`` comment. The MTP fix in 6d28994 proved the comment-only pattern is brittle: sglang main 2cf1a4ab defaults the env on, and CAR_V2 segfaults during cuda graph capture on multi-node decode. Make the disable explicit so a future image bump on the base sweep can't regress the same way. Affects 6 recipes: 1p1d-tp4-tp4-2-c1, 1p1d-dep4-dep16-5-c1024, 4p1d-dep4-dep16-8-c1024, 8p1d-dep4-dep16-12-c4096, 10p1d-dep4-dep16-14-c8192, 12p1d-dep4-dep12-15-c21504. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set both old and new sglang thinking/reasoning env vars in MTP recipes sglang main 2cf1a4ab moved ``SGLANG_ENABLE_THINKING`` → ``SGLANG_DEFAULT_THINKING`` and ``SGLANG_REASONING_EFFORT`` → ``SGLANG_DSV4_REASONING_EFFORT``. The deprecation helper ``_print_deprecated_env`` (environ.py:642) only emits a warning — it does NOT propagate the value to the new name. So the old env vars were silently ignored: server defaulted to non-thinking mode with empty reasoning effort, dropping GSM8K accuracy from ~95% to ~40% (eval_results_all from run 25583345967: em_strict=0.4291 for 1p6d-dep4-tp4 conc=64, 0.4056 for 4p1d-dep4-dep8 conc=1024). Set both names in prefill_environment and decode_environment of all six MTP recipes: * old names — read by the sa-bench client tokenizer (sa_bench_tokenizers.sglang_deepseek_v4) for prompt-rendering parity with the server. * new names — read by the sglang server in 2cf1a4ab+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Set tool-call-parser=deepseekv4 to enable DSV4 chat encoding (gsm8k regression fix) GSM8K accuracy on the latest sweep dropped from the expected ~95% to ~40% (em_strict=0.4291 for 1p6d-dep4-tp4 conc=64; 0.4056 for 4p1d-dep4-dep8 conc=1024 — run 25583345967 eval_results_all). Inspecting samples_gsm8k_*.jsonl revealed every response was prefixed with junk like "Weapon:" / "Weaponized" / "We黑白颠倒", and the reasoning often answered a different question than what was asked — classic symptom of a malformed chat-template prompt. Root cause in sglang main 2cf1a4ab (entrypoints/openai/serving_chat.py:296): def _resolve_chat_encoding_spec(self) -> Optional[str]: if self.tool_call_parser == "deepseekv4": return "dsv4" if self.tool_call_parser == "deepseekv32": return "dsv32" The dsv4 chat-encoding spec — which routes DSV4 prompts through ``encoding_dsv4.encode_messages`` with thinking-mode and reasoning-effort handling — only activates when ``--tool-call-parser deepseekv4`` is set. Without it the server falls back to the vanilla HF chat template (``apply_chat_template``), which doesn't know about DSV4's special tokens, ``<think>`` blocks, or the ``thinking_mode`` argument. The MTP recipes never set this flag, so ServerArgs reports ``tool_call_parser=None`` and the model receives a malformed prompt. Add ``tool-call-parser: deepseekv4`` to both prefill and decode ``sglang_config`` blocks in all 6 MTP recipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert CAR_V2 explicit-disable in non-MTP base 8k1k recipes Restore the 6 base recipes to their state on origin/main; the explicit ``SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: \"0\"`` was added defensively in 9c4c244, but the base sweep is happy on its current staging-dev image and shouldn't be touched in this PR. Reverts files: disagg-gb300-1p1d-tp4-tp4-2-c1.yaml disagg-gb300-1p1d-dep4-dep16-5-c1024.yaml disagg-gb300-4p1d-dep4-dep16-8-c1024.yaml disagg-gb300-8p1d-dep4-dep16-12-c4096.yaml disagg-gb300-10p1d-dep4-dep16-14-c8192.yaml disagg-gb300-12p1d-dep4-dep12-15-c21504.yaml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Trim verbose comments and drop deprecated env var names in MTP recipes - Drop ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (deprecated since sglang main 2cf1a4ab); keep only the new names ``SGLANG_DEFAULT_THINKING`` / ``SGLANG_DSV4_REASONING_EFFORT``. - Bump the srt-slurm fork pin to 51847632 so the sa-bench client tokenizer reads the new env names (with old names as fallback). - Trim multi-line block comments down to one-line tail comments for the CAR_V2 disable and ``tool-call-parser: deepseekv4`` flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert MTP recipes to staging-dev container (gsm8k accuracy fix) The bump to ``lmsysorg/sglang:nightly-dev-cu13-20260508-2cf1a4ab`` introduced an MTP-path accuracy regression: gsm8k em_strict dropped from the expected ~0.93 to ~0.42 (run 25585766931 eval_results_all shows 0.4200 for 4p1d-dep4-dep8 conc=1024). Local repro on the cluster: the failed 5-shot prompt sent through plain sglang chat completion returns the correct answer; through the dynamo+nightly pipeline it returns garbage prefixed with junk tokens. Restore the same staging-dev container the base ``dsv4-fp4-gb300- dynamo-sglang`` sweep already runs on. Drop the dependent flags that only existed because of the nightly bump: - container: nightly-dev-cu13-20260508-2cf1a4ab → sglang-staging: deepseek-v4-grace-blackwell-dev (matches the matrix entry's image) - ``tool-call-parser: deepseekv4`` removed (the chat-encoding-spec routing it gated on doesn't exist in staging-dev; HF chat_template handles DSV4 prompts directly via dynamo's native Rust formatter). - Env vars reverted to ``SGLANG_ENABLE_THINKING`` / ``SGLANG_REASONING_EFFORT`` (the names staging-dev recognizes). - nvidia-master.yaml MTP entry image updated to match. The dynamo hash, srt-slurm fork pin, sbatch_directives, and multi-node CAR_V2 disable all stay (still required). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump dynamo hash to 34d55a5 to fix DSV4 chat-template formatter Local repro on the cluster (job 2226, slurm-gb300-139-009) confirmed the regression is in dynamo's wrapper, not in sglang main: - sglang main (2cf1a4ab) standalone, same failed 5-shot: prompt_tokens=1128, answer=18 (correct). - sglang main + dynamo 81d0555e (CI): answer="Weapon:#### 16" (em_strict=0.42). The pinned dynamo at 81d0555e ships an older Rust DSV4 prompt formatter whose ``render()`` always calls ``encode_messages(...)`` — which hardcodes ``reasoning_effort=None`` and ignores ``chat_template_kwargs`` entirely. That produces a prompt the model fails on under MTP. Dynamo PR #9322 (commit 34d55a5, "Deduplicate DeepSeek prompt encoders v3.2 and v4") rewrote ``render()`` to read ``reasoning_effort`` and ``drop_thinking`` from ``chat_template_args`` and plumb them into ``encode_messages_with_options``, fixing the DSV4 prompt rendering. Restore the changes the staging-dev revert had to undo: - container: nightly-dev-cu13-20260508-2cf1a4ab - tool-call-parser: deepseekv4 (gates the dsv4 chat-encoding spec) - SGLANG_DEFAULT_THINKING / SGLANG_DSV4_REASONING_EFFORT - dynamo.hash 81d0555e -> 34d55a5 - nvidia-master.yaml MTP entry image CAR_V2 disable on multi-node decode and the srt-slurm fork pin remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump sglang container to nightly-dev-cu13-20260509-9ee83034 Latest sglang main build (sgl-project/sglang Actions run 25586829316, head_sha 9ee83034, completed 2026-05-09 00:51 UTC). Pairs with the dynamo bump in 9b06113 (commit 34d55a5, PR #9322 — DSV4 chat- template formatter rewrite). Updated all 6 MTP recipe ``container:`` fields and the ``dsv4-fp4-gb300-dynamo-sglang-mtp`` matrix entry's ``image:`` in nvidia-master.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Switch DSV4 MTP recipes to nixl KV transfer backend The mooncake backend has a KV-transfer bug that produces wrong gsm8k answers when prompts end on the `<think>` token (id 128821). Empirically: same input on monolithic sglang gives correct answer, mooncake-disagg gives wrong, nixl-disagg gives correct. Bug filed upstream; using nixl as workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "Switch DSV4 MTP recipes to nixl KV transfer backend" This reverts commit 3275282. * Bump MTP recipes to sglang nightly with mooncake DSv4 fix Picks up sgl-project/sglang#24878 (merged as c7f674e4), which adds the missing dsv4 state_type branch to MooncakeKVManager.maybe_send_extra. Combined with the prior revert of #1297's nixl switch (commit daa6785), the mooncake backend now correctly transfers DSv4's flat heterogeneous state pool for both non-MTP and MTP runs. Validated on GB300 1P+1D: comp_with_think.json (the prompt ending on the literal `<think>` token that previously surfaced the corruption) now returns the correct gsm8k Janet answer (`#### 18`) on mooncake disagg, matching mono and the NIXL control. MTP sa-bench delivers ~136 tok/s output throughput (~1.7x non-MTP), confirming draft acceptance is working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: switch srt-slurm pin to NVIDIA/srt-slurm main (#144 merged) NVIDIA/srt-slurm#144 (``sa-bench: make SGLangDeepseekV4Tokenizer callable``) merged as 0cbc7eb4. Drop the ch-wan/srt-slurm fork pin that was only there while #144 was in review and pin to the upstream merge commit instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * gb300-cw: track NVIDIA/srt-slurm main instead of pinning a commit Now that #144 is merged, no longer need to pin a specific commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump MTP recipes to sglang nightly 20260510-2473659e Picks up sgl-project/sglang main commit 2473659e (built via upstream workflow run 25639473178). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: use shared gb300 dsv4 model path --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
1 parent b6278ae commit 4007906

9 files changed

Lines changed: 1013 additions & 2 deletions

.github/configs/nvidia-master.yaml

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8425,3 +8425,110 @@ dsv4-fp4-gb300-dynamo-sglang:
84258425
tp: 12
84268426
ep: 12
84278427
dp-attn: true
8428+
8429+
# MTP variant of dsv4-fp4-gb300-dynamo-sglang.
8430+
dsv4-fp4-gb300-dynamo-sglang-mtp:
8431+
image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034
8432+
model: deepseek-ai/DeepSeek-V4-Pro
8433+
model-prefix: dsv4
8434+
runner: gb300-cw
8435+
precision: fp4
8436+
framework: dynamo-sglang
8437+
multinode: true
8438+
disagg: true
8439+
scenarios:
8440+
fixed-seq-len:
8441+
- isl: 8192
8442+
osl: 1024
8443+
search-space:
8444+
# Low-latency baseline: 1p1d-tp4-tp4. 2 nodes.
8445+
- spec-decoding: "mtp"
8446+
conc-list: [1]
8447+
prefill:
8448+
num-worker: 1
8449+
tp: 4
8450+
ep: 1
8451+
dp-attn: false
8452+
additional-settings:
8453+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-low-latency-1p1d-tp4-tp4-mtp.yaml"
8454+
decode:
8455+
num-worker: 1
8456+
tp: 4
8457+
ep: 1
8458+
dp-attn: false
8459+
# Low-latency 1p6d-dep4-tp4: 1P (DEP=4) + 6 TP=4 decode workers. 7 nodes.
8460+
# Recipe runs concurrencies=8x32x64; matrix tracks the max.
8461+
- spec-decoding: "mtp"
8462+
conc-list: [64]
8463+
prefill:
8464+
num-worker: 1
8465+
tp: 4
8466+
ep: 4
8467+
dp-attn: true
8468+
additional-settings:
8469+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-low-latency-1p6d-dep4-tp4-mtp.yaml"
8470+
decode:
8471+
num-worker: 6
8472+
tp: 4
8473+
ep: 1
8474+
dp-attn: false
8475+
# Mid curve 1p1d-dep4-dep8. 3 nodes.
8476+
- spec-decoding: "mtp"
8477+
conc-list: [256]
8478+
prefill:
8479+
num-worker: 1
8480+
tp: 4
8481+
ep: 4
8482+
dp-attn: true
8483+
additional-settings:
8484+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-mid-curve-1p1d-dep4-dep8-mtp.yaml"
8485+
decode:
8486+
num-worker: 1
8487+
tp: 8
8488+
ep: 8
8489+
dp-attn: true
8490+
# Mid curve 1p1d-dep4-dep16. 5 nodes.
8491+
- spec-decoding: "mtp"
8492+
conc-list: [256]
8493+
prefill:
8494+
num-worker: 1
8495+
tp: 4
8496+
ep: 4
8497+
dp-attn: true
8498+
additional-settings:
8499+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-mid-curve-1p1d-dep4-dep16-mtp.yaml"
8500+
decode:
8501+
num-worker: 1
8502+
tp: 16
8503+
ep: 16
8504+
dp-attn: true
8505+
# Mid curve 2p1d-dep4-dep8. 4 nodes.
8506+
- spec-decoding: "mtp"
8507+
conc-list: [512]
8508+
prefill:
8509+
num-worker: 2
8510+
tp: 4
8511+
ep: 4
8512+
dp-attn: true
8513+
additional-settings:
8514+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-mid-curve-2p1d-dep4-dep8-mtp.yaml"
8515+
decode:
8516+
num-worker: 1
8517+
tp: 8
8518+
ep: 8
8519+
dp-attn: true
8520+
# Mid curve 4p1d-dep4-dep8. 6 nodes.
8521+
- spec-decoding: "mtp"
8522+
conc-list: [1024]
8523+
prefill:
8524+
num-worker: 4
8525+
tp: 4
8526+
ep: 4
8527+
dp-attn: true
8528+
additional-settings:
8529+
- "CONFIG_FILE=recipes/sglang/deepseek-v4/8k1k/disagg-mid-curve-4p1d-dep4-dep8-mtp.yaml"
8530+
decode:
8531+
num-worker: 1
8532+
tp: 8
8533+
ep: 8
8534+
dp-attn: true
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
name: "dsv4-pro-gb300-disagg-8k1k-low-latency-1p1d-tp4-tp4-mtp"
2+
3+
frontend:
4+
type: dynamo
5+
enable_multiple_frontends: true
6+
num_additional_frontends: 8
7+
8+
dynamo:
9+
hash: "34d55a596fb8d3d44daefe425ec1e303131f4d2c"
10+
install: true
11+
12+
model:
13+
path: "deepseek-v4-pro"
14+
container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"
15+
precision: "mxfp4"
16+
17+
sbatch_directives:
18+
cpus-per-task: "144"
19+
mem: "0"
20+
21+
resources:
22+
gpu_type: "gb300"
23+
gpus_per_node: 4
24+
prefill_nodes: 1
25+
prefill_workers: 1
26+
decode_nodes: 1
27+
decode_workers: 1
28+
29+
backend:
30+
type: sglang
31+
32+
prefill_environment:
33+
PYTHONUNBUFFERED: "1"
34+
SGLANG_RADIX_DISABLE_REUSE: "1"
35+
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
36+
SGLANG_DEFAULT_THINKING: "1"
37+
SGLANG_DSV4_REASONING_EFFORT: "max"
38+
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
39+
SGLANG_OPT_USE_JIT_NORM: "1"
40+
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
41+
SGLANG_OPT_USE_TOPK_V2: "1"
42+
NCCL_MNNVL_ENABLE: "1"
43+
NCCL_CUMEM_ENABLE: "1"
44+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
45+
MC_FORCE_MNNVL: "1"
46+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
47+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
48+
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
49+
50+
decode_environment:
51+
PYTHONUNBUFFERED: "1"
52+
SGLANG_RADIX_DISABLE_REUSE: "1"
53+
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
54+
SGLANG_DEFAULT_THINKING: "1"
55+
SGLANG_DSV4_REASONING_EFFORT: "max"
56+
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
57+
SGLANG_OPT_USE_JIT_NORM: "1"
58+
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
59+
SGLANG_OPT_USE_TOPK_V2: "1"
60+
NCCL_MNNVL_ENABLE: "1"
61+
NCCL_CUMEM_ENABLE: "1"
62+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
63+
MC_FORCE_MNNVL: "1"
64+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
65+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
66+
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
67+
# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
68+
# is single-node only and corrupts results in 2-node decode setups.
69+
70+
sglang_config:
71+
prefill:
72+
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
73+
model-path: "/model/"
74+
trust-remote-code: true
75+
tool-call-parser: deepseekv4 # gates dsv4 chat-encoding spec.
76+
77+
disaggregation-mode: "prefill"
78+
disaggregation-transfer-backend: mooncake
79+
80+
tensor-parallel-size: 4
81+
data-parallel-size: 1
82+
expert-parallel-size: 1
83+
84+
moe-runner-backend: "flashinfer_mxfp4"
85+
disable-flashinfer-autotune: true
86+
87+
mem-fraction-static: 0.9
88+
max-running-requests: 8
89+
cuda-graph-max-bs: 8
90+
chunked-prefill-size: 32768
91+
92+
decode:
93+
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
94+
model-path: "/model/"
95+
trust-remote-code: true
96+
tool-call-parser: deepseekv4 # gates dsv4 chat-encoding spec.
97+
98+
disaggregation-mode: "decode"
99+
disaggregation-transfer-backend: mooncake
100+
101+
tensor-parallel-size: 4
102+
data-parallel-size: 1
103+
expert-parallel-size: 1
104+
105+
moe-runner-backend: "flashinfer_mxfp4"
106+
disable-flashinfer-autotune: true
107+
108+
speculative-algo: "EAGLE"
109+
speculative-num-steps: 3
110+
speculative-eagle-topk: 1
111+
speculative-num-draft-tokens: 4
112+
113+
mem-fraction-static: 0.9
114+
max-running-requests: 8
115+
cuda-graph-max-bs: 8
116+
swa-full-tokens-ratio: 0.1
117+
context-length: 16384
118+
119+
benchmark:
120+
type: "sa-bench"
121+
isl: 8192
122+
osl: 1024
123+
random_range_ratio: 0.8
124+
concurrencies: "1"
125+
req_rate: "inf"
126+
use_chat_template: true
127+
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
name: "dsv4-pro-gb300-disagg-8k1k-low-latency-1p6d-dep4-tp4-mtp"
2+
3+
frontend:
4+
type: dynamo
5+
enable_multiple_frontends: true
6+
num_additional_frontends: 8
7+
8+
dynamo:
9+
hash: "34d55a596fb8d3d44daefe425ec1e303131f4d2c"
10+
install: true
11+
12+
model:
13+
path: "deepseek-v4-pro"
14+
container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"
15+
precision: "mxfp4"
16+
17+
sbatch_directives:
18+
cpus-per-task: "144"
19+
mem: "0"
20+
21+
resources:
22+
gpu_type: "gb300"
23+
gpus_per_node: 4
24+
prefill_nodes: 1
25+
prefill_workers: 1
26+
decode_nodes: 6
27+
decode_workers: 6
28+
29+
backend:
30+
type: sglang
31+
32+
prefill_environment:
33+
PYTHONUNBUFFERED: "1"
34+
SGLANG_RADIX_DISABLE_REUSE: "1"
35+
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
36+
SGLANG_DEFAULT_THINKING: "1"
37+
SGLANG_DSV4_REASONING_EFFORT: "max"
38+
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
39+
SGLANG_OPT_USE_JIT_NORM: "1"
40+
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
41+
SGLANG_OPT_USE_TOPK_V2: "1"
42+
43+
SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1"
44+
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1"
45+
SGLANG_OPT_USE_FAST_MASK_EP: "1"
46+
SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1"
47+
SGLANG_OPT_FIX_HASH_MEGA_MOE: "1"
48+
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "9216"
49+
SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1"
50+
SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1"
51+
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0"
52+
53+
NCCL_MNNVL_ENABLE: "1"
54+
NCCL_CUMEM_ENABLE: "1"
55+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
56+
MC_FORCE_MNNVL: "1"
57+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
58+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
59+
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
60+
61+
decode_environment:
62+
PYTHONUNBUFFERED: "1"
63+
SGLANG_RADIX_DISABLE_REUSE: "1"
64+
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1"
65+
SGLANG_DEFAULT_THINKING: "1"
66+
SGLANG_DSV4_REASONING_EFFORT: "max"
67+
SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1"
68+
SGLANG_OPT_USE_JIT_NORM: "1"
69+
SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1"
70+
SGLANG_OPT_USE_TOPK_V2: "1"
71+
NCCL_MNNVL_ENABLE: "1"
72+
NCCL_CUMEM_ENABLE: "1"
73+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
74+
MC_FORCE_MNNVL: "1"
75+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
76+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
77+
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
78+
# SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2
79+
# is single-node only and corrupts results in 2-node decode setups.
80+
81+
sglang_config:
82+
prefill:
83+
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
84+
model-path: "/model/"
85+
trust-remote-code: true
86+
tool-call-parser: deepseekv4 # gates dsv4 chat-encoding spec.
87+
88+
disaggregation-mode: "prefill"
89+
disaggregation-transfer-backend: mooncake
90+
91+
tensor-parallel-size: 4
92+
data-parallel-size: 4
93+
expert-parallel-size: 4
94+
95+
enable-dp-attention: true
96+
enable-dp-lm-head: true
97+
98+
moe-a2a-backend: "deepep"
99+
deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}'
100+
101+
mem-fraction-static: 0.9
102+
max-running-requests: 128
103+
cuda-graph-max-bs: 128
104+
chunked-prefill-size: 32768
105+
106+
decode:
107+
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
108+
model-path: "/model/"
109+
trust-remote-code: true
110+
tool-call-parser: deepseekv4 # gates dsv4 chat-encoding spec.
111+
112+
disaggregation-mode: "decode"
113+
disaggregation-transfer-backend: mooncake
114+
115+
tensor-parallel-size: 4
116+
data-parallel-size: 1
117+
expert-parallel-size: 1
118+
119+
moe-runner-backend: "flashinfer_mxfp4"
120+
disable-flashinfer-autotune: true
121+
122+
speculative-algo: "EAGLE"
123+
speculative-num-steps: 3
124+
speculative-eagle-topk: 1
125+
speculative-num-draft-tokens: 4
126+
127+
mem-fraction-static: 0.9
128+
max-running-requests: 128
129+
cuda-graph-max-bs: 128
130+
swa-full-tokens-ratio: 0.1
131+
context-length: 16384
132+
133+
benchmark:
134+
type: "sa-bench"
135+
isl: 8192
136+
osl: 1024
137+
random_range_ratio: 0.8
138+
concurrencies: "8x32x64"
139+
req_rate: "inf"
140+
use_chat_template: true
141+
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"

0 commit comments

Comments
 (0)