Skip to content

Add Step-3.7-Flash NVFP4 support: harness reasoning-effort + dual-Blackwell setup guide#27

Merged
Lightheartdevs merged 7 commits into
mainfrom
add-step3p7-flash-nvfp4-microbench
May 29, 2026
Merged

Add Step-3.7-Flash NVFP4 support: harness reasoning-effort + dual-Blackwell setup guide#27
Lightheartdevs merged 7 commits into
mainfrom
add-step3p7-flash-nvfp4-microbench

Conversation

@Lightheartdevs
Copy link
Copy Markdown
Contributor

What

Enables and documents benchmarking StepFun Step-3.7-Flash NVFP4 (day-one, 2026-05-28; 201B MoE VLM, ~11B active, 256k ctx, 3 reasoning levels) through the MMBT microbench on a 2× RTX PRO 6000 Blackwell (sm_120, no NVLink) box.

This PR is configuration + harness enablement. Microbench results (low/medium/high reasoning) follow in a subsequent PR.

Harness change

  • harness.py: new --reasoning-effort {low,medium,high} — sent as top-level chat_template_kwargs (Step-3.7's chat template injects "Reasoning: <effort>" into the system turn) and recorded in the receipt. No-op for models that don't read it.
  • run_microbench.sh / smoke_test.sh: optional trailing reasoning-effort arg.

The dual-Blackwell setup writeup

New hardware-tests/step3.7-flash-nvfp4-dual-blackwell-2026-05-28/ (README + findings) documents the working launch command and the four non-obvious flags it took to find — there is no official 2×6000 recipe anywhere (model card / StepFun GitHub / vLLM recipes all target 4–8 GPU servers):

  1. --disable-custom-all-reduce — the keystone. vLLM's CUSTOM all-reduce needs GPU P2P these Workstation cards lack over PCIe (no NVLink); without it the server deadlocks on the first TP collective (hangs at NCCL init / cudagraph capture / attention warmup — all one root cause).
  2. --moe-backend cutlass — native FP4 on the experts (default auto → Marlin dequant). Step-3.7's SWIGLUSTEP activation is only supported by the VLLM_CUTLASS/MARLIN FP4 kernels, not the FlashInfer ones — so cutlass is the only native-FP4 option for this model.
  3. no --enable-expert-parallelVLLM_CUTLASS FP4 MoE shards experts via TP.
  4. --max-model-len 262144 (native) — the harness max_tokens budget assumes it.

Caveat documented: --enforce-eager is currently on (cudagraph capture was the same all-reduce bug); throughput from this config is eager-qualified.

Verification

  • Logs: Using 'VLLM_CUTLASS' NvFp4 MoE backend, no Marlin warning, ['PYNCCL'] all-reduce.
  • MMBT smoke (structured extraction, reasoning=medium): PASS, 20/20 fields, accuracy 1.0.

Also updated

  • tooling/launch-commands.md: Step-3.7 launch section + parser-table row.
  • README.md: hardware-tests table row.
  • claims.yaml: two provisional claims (custom-all-reduce hang; native-FP4-requires-VLLM_CUTLASS).

🤖 Generated with Claude Code

User Name and others added 5 commits May 28, 2026 23:40
… levels

Step-3.7-Flash exposes low/medium/high reasoning levels via a chat-template
reasoning_effort variable. Add --reasoning-effort {low,medium,high} to harness.py
(sent as top-level chat_template_kwargs, recorded in the receipt) and an optional
trailing reasoning-effort arg to run_microbench.sh and smoke_test.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Day-one (2026-05-28) 201B MoE VLM served under vLLM on a 2-GPU sm_120
Workstation Blackwell box with native NVFP4 + FP8 KV. No official 2x6000
recipe exists upstream (checked model card, StepFun GitHub, vLLM recipes),
so this captures the working launch command and the four non-obvious flags:

- --disable-custom-all-reduce: vLLM CUSTOM all-reduce deadlocks without
  GPU P2P (these Workstation cards have no NVLink); root cause of hangs at
  NCCL init / cudagraph capture / attention warmup.
- --moe-backend cutlass: native FP4 on the MoE experts (auto -> Marlin
  dequant). Step-3.7's SWIGLUSTEP activation is only supported by the
  VLLM_CUTLASS and MARLIN FP4 kernels, not the FlashInfer ones.
- no --enable-expert-parallel: VLLM_CUTLASS FP4 MoE shards experts via TP.
- --max-model-len 262144 (native): harness max_tokens budget assumes it.

Adds hardware-tests/ entry (README + findings), a launch-commands.md
section + parser-table row, a main-README table row, and two claims.
Microbench results (3 reasoning levels) to follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aphs work)

vllm bench serve battery on 2x RTX PRO 6000 @ 600W: single-stream decode
~99 tok/s (TPOT 10ms), ~1.5k tok/s aggregate output at 64-way concurrency
(peak 2.1k). CUDA Graphs capture cleanly once --disable-custom-all-reduce is
set (the capture 'hang' was the same custom-all-reduce bug) and are 4.7x
faster single-stream / 1.8x higher batched throughput than eager. Recommended
config now runs with CUDA Graphs on; --enforce-eager demoted to fallback.

Adds throughput.md, updates README/findings/launch-commands, +1 claim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…USTEP claim, add claim caveats

Pre-merge audit (independent doc review + source/live verification):
- B1: state the 1.8x cudagraph-vs-eager speedup on aggregate OUTPUT tok/s
  (530->963), not the in+out composite the repo retracted.
- B2: drop undefined 'peak 2.1k' from headlines; define it (instantaneous
  peak within the conc=64 run) in a table footnote; lead with means.
- S1: scope the SWIGLUSTEP evidence — triton_moe/deep_gemm_moe declare it but
  are non-FP4 kernels, not --moe-backend NVFP4 options. Verified the 8
  NvFp4MoeBackend->experts mappings: only VLLM_CUTLASS and MARLIN support it.
- S2: reconcile the 26 GiB KV figure (measured available KV, after activation
  + cudagraph reserve) vs the naive 88-58.6 budget.
- S5: add caveats + promote_to_strong_when to all three hw.step37.* claims.
- Nits: SWIGLUSTEP gloss at first mention, TTFT-queueing note, read-order
  wording, bump claims.yaml last_updated to 2026-05-28.

Verified against the live serving instance: running args == documented command,
'Using VLLM_CUTLASS NvFp4 MoE backend', PYNCCL all-reduce, 0 Marlin warnings,
GPU topo NODE (no NVLink). All numbers reconciled to raw vllm bench output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…level-independent)

The throughput battery ran at default reasoning_effort (unset), random data.
Decode tok/s is a per-token rate, identical across low/medium/high; reasoning
level changes tokens-per-task (end-to-end latency), measured by the microbench.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs force-pushed the add-step3p7-flash-nvfp4-microbench branch from 1d89bc0 to c741308 Compare May 29, 2026 03:41
…r notes

Usability/citability audit pass — make the entry reproducible and citable to the
house bar (qwen3.6-q8-fleet sibling):
- README: add a copy-paste "Reproduce (zero -> same numbers)" section (weights
  download via curl, image pull + full digest, launch, smoke gate, bench); full
  image digest in env table; expanded read order.
- throughput.md: give the exact "vllm bench serve" invocation (tokenizer path,
  dataset, lens, concurrency) so every row is regenerable; link bench-raw.txt.
- bench-raw.txt (NEW): raw per-cell vllm-bench result lines + exact params;
  notes the eager-vs-cudagraph conc=1 cell-config difference honestly.
- manifest.json (NEW): machine-readable provenance (hardware, image digest,
  model, serving config, throughput grid, smoke result, claim ids).
- NOTES-FOR-REVIEWERS.md (NEW): limitations + promote-to-strong conditions.
- findings.md: paste the actual nvidia-smi topo matrix (keystone evidence).
- hardware-tests/README.md: add the bundle to the coverage table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs force-pushed the add-step3p7-flash-nvfp4-microbench branch from 9a31055 to 83f332c Compare May 29, 2026 03:51
…accuracy reconcile, script help)

[P2] run_microbench.sh: fail fast when --reasoning-effort is set but the effort
isn't in the label. Run names + the idempotent skip are keyed by label only, so
sweeping low/medium/high under one label silently skipped later efforts as
"already complete". Guard exits 2 with a corrected command. (Kept label-keyed
naming rather than appending effort to run names, which would break the
grade/summarize globs.)

[P2] Reconcile smoke accuracy with the source of truth (manifest.json): the
cudagraph config scored field accuracy 0.95 (19/20); the earlier eager-config
smoke scored 1.0 (20/20). README and findings previously claimed 1.0 for the
cudagraph config. Both now state 0.95 (cudagraph) / 1.0 (eager), note the
1-field delta is temp=0.3 variance, and point at manifest.json smoke_test.

[P3] Update script help for the new optional reasoning-effort arg: usage lines
and inline EOF help in run_microbench.sh and smoke_test.sh, plus smoke's printed
"Next" command now carries the effort through (and reminds it must be in the label).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit 251ed4b into main May 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant