Add Step-3.7-Flash NVFP4 support: harness reasoning-effort + dual-Blackwell setup guide#27
Merged
Merged
Conversation
… levels
Step-3.7-Flash exposes low/medium/high reasoning levels via a chat-template
reasoning_effort variable. Add --reasoning-effort {low,medium,high} to harness.py
(sent as top-level chat_template_kwargs, recorded in the receipt) and an optional
trailing reasoning-effort arg to run_microbench.sh and smoke_test.sh.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Day-one (2026-05-28) 201B MoE VLM served under vLLM on a 2-GPU sm_120 Workstation Blackwell box with native NVFP4 + FP8 KV. No official 2x6000 recipe exists upstream (checked model card, StepFun GitHub, vLLM recipes), so this captures the working launch command and the four non-obvious flags: - --disable-custom-all-reduce: vLLM CUSTOM all-reduce deadlocks without GPU P2P (these Workstation cards have no NVLink); root cause of hangs at NCCL init / cudagraph capture / attention warmup. - --moe-backend cutlass: native FP4 on the MoE experts (auto -> Marlin dequant). Step-3.7's SWIGLUSTEP activation is only supported by the VLLM_CUTLASS and MARLIN FP4 kernels, not the FlashInfer ones. - no --enable-expert-parallel: VLLM_CUTLASS FP4 MoE shards experts via TP. - --max-model-len 262144 (native): harness max_tokens budget assumes it. Adds hardware-tests/ entry (README + findings), a launch-commands.md section + parser-table row, a main-README table row, and two claims. Microbench results (3 reasoning levels) to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aphs work) vllm bench serve battery on 2x RTX PRO 6000 @ 600W: single-stream decode ~99 tok/s (TPOT 10ms), ~1.5k tok/s aggregate output at 64-way concurrency (peak 2.1k). CUDA Graphs capture cleanly once --disable-custom-all-reduce is set (the capture 'hang' was the same custom-all-reduce bug) and are 4.7x faster single-stream / 1.8x higher batched throughput than eager. Recommended config now runs with CUDA Graphs on; --enforce-eager demoted to fallback. Adds throughput.md, updates README/findings/launch-commands, +1 claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…USTEP claim, add claim caveats Pre-merge audit (independent doc review + source/live verification): - B1: state the 1.8x cudagraph-vs-eager speedup on aggregate OUTPUT tok/s (530->963), not the in+out composite the repo retracted. - B2: drop undefined 'peak 2.1k' from headlines; define it (instantaneous peak within the conc=64 run) in a table footnote; lead with means. - S1: scope the SWIGLUSTEP evidence — triton_moe/deep_gemm_moe declare it but are non-FP4 kernels, not --moe-backend NVFP4 options. Verified the 8 NvFp4MoeBackend->experts mappings: only VLLM_CUTLASS and MARLIN support it. - S2: reconcile the 26 GiB KV figure (measured available KV, after activation + cudagraph reserve) vs the naive 88-58.6 budget. - S5: add caveats + promote_to_strong_when to all three hw.step37.* claims. - Nits: SWIGLUSTEP gloss at first mention, TTFT-queueing note, read-order wording, bump claims.yaml last_updated to 2026-05-28. Verified against the live serving instance: running args == documented command, 'Using VLLM_CUTLASS NvFp4 MoE backend', PYNCCL all-reduce, 0 Marlin warnings, GPU topo NODE (no NVLink). All numbers reconciled to raw vllm bench output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…level-independent) The throughput battery ran at default reasoning_effort (unset), random data. Decode tok/s is a per-token rate, identical across low/medium/high; reasoning level changes tokens-per-task (end-to-end latency), measured by the microbench. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1d89bc0 to
c741308
Compare
…r notes Usability/citability audit pass — make the entry reproducible and citable to the house bar (qwen3.6-q8-fleet sibling): - README: add a copy-paste "Reproduce (zero -> same numbers)" section (weights download via curl, image pull + full digest, launch, smoke gate, bench); full image digest in env table; expanded read order. - throughput.md: give the exact "vllm bench serve" invocation (tokenizer path, dataset, lens, concurrency) so every row is regenerable; link bench-raw.txt. - bench-raw.txt (NEW): raw per-cell vllm-bench result lines + exact params; notes the eager-vs-cudagraph conc=1 cell-config difference honestly. - manifest.json (NEW): machine-readable provenance (hardware, image digest, model, serving config, throughput grid, smoke result, claim ids). - NOTES-FOR-REVIEWERS.md (NEW): limitations + promote-to-strong conditions. - findings.md: paste the actual nvidia-smi topo matrix (keystone evidence). - hardware-tests/README.md: add the bundle to the coverage table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9a31055 to
83f332c
Compare
…accuracy reconcile, script help) [P2] run_microbench.sh: fail fast when --reasoning-effort is set but the effort isn't in the label. Run names + the idempotent skip are keyed by label only, so sweeping low/medium/high under one label silently skipped later efforts as "already complete". Guard exits 2 with a corrected command. (Kept label-keyed naming rather than appending effort to run names, which would break the grade/summarize globs.) [P2] Reconcile smoke accuracy with the source of truth (manifest.json): the cudagraph config scored field accuracy 0.95 (19/20); the earlier eager-config smoke scored 1.0 (20/20). README and findings previously claimed 1.0 for the cudagraph config. Both now state 0.95 (cudagraph) / 1.0 (eager), note the 1-field delta is temp=0.3 variance, and point at manifest.json smoke_test. [P3] Update script help for the new optional reasoning-effort arg: usage lines and inline EOF help in run_microbench.sh and smoke_test.sh, plus smoke's printed "Next" command now carries the effort through (and reminds it must be in the label). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Enables and documents benchmarking StepFun Step-3.7-Flash NVFP4 (day-one, 2026-05-28; 201B MoE VLM, ~11B active, 256k ctx, 3 reasoning levels) through the MMBT microbench on a 2× RTX PRO 6000 Blackwell (sm_120, no NVLink) box.
This PR is configuration + harness enablement. Microbench results (low/medium/high reasoning) follow in a subsequent PR.
Harness change
harness.py: new--reasoning-effort {low,medium,high}— sent as top-levelchat_template_kwargs(Step-3.7's chat template injects"Reasoning: <effort>"into the system turn) and recorded in the receipt. No-op for models that don't read it.run_microbench.sh/smoke_test.sh: optional trailing reasoning-effort arg.The dual-Blackwell setup writeup
New
hardware-tests/step3.7-flash-nvfp4-dual-blackwell-2026-05-28/(README + findings) documents the working launch command and the four non-obvious flags it took to find — there is no official 2×6000 recipe anywhere (model card / StepFun GitHub / vLLM recipes all target 4–8 GPU servers):--disable-custom-all-reduce— the keystone. vLLM's CUSTOM all-reduce needs GPU P2P these Workstation cards lack over PCIe (no NVLink); without it the server deadlocks on the first TP collective (hangs at NCCL init / cudagraph capture / attention warmup — all one root cause).--moe-backend cutlass— native FP4 on the experts (defaultauto→ Marlin dequant). Step-3.7'sSWIGLUSTEPactivation is only supported by theVLLM_CUTLASS/MARLINFP4 kernels, not the FlashInfer ones — socutlassis the only native-FP4 option for this model.--enable-expert-parallel—VLLM_CUTLASSFP4 MoE shards experts via TP.--max-model-len 262144(native) — the harness max_tokens budget assumes it.Caveat documented:
--enforce-eageris currently on (cudagraph capture was the same all-reduce bug); throughput from this config is eager-qualified.Verification
Using 'VLLM_CUTLASS' NvFp4 MoE backend, no Marlin warning,['PYNCCL']all-reduce.Also updated
tooling/launch-commands.md: Step-3.7 launch section + parser-table row.README.md: hardware-tests table row.claims.yaml: two provisional claims (custom-all-reduce hang; native-FP4-requires-VLLM_CUTLASS).🤖 Generated with Claude Code