[NV] Add GitHub Action to collect SPEED-Bench AL matrix#1650
Conversation
Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
Make the workflow default to Option 1 (upload the AL matrix as an artifact for manual review/paste) rather than auto-opening a PR. The auto-PR path stays available as an opt-in (open-pr: true), but keeping it off by default avoids exposing a write-scoped PAT on the self-hosted runner and matches the repo's artifact-collection convention.
| # HF id; its basename (DeepSeek-V4-Pro) is in the launcher's STAGED_MODELS, so | ||
| # the launcher resolves MODEL_PATH to the pre-staged local weights and mounts | ||
| # them. The collector serves from MODEL_PATH (see SERVE_MODEL), so no download. | ||
| MODEL: deepseek-ai/DeepSeek-V4-Pro |
There was a problem hiding this comment.
Can we please update the model to be ${{ inputs.model }}? This would require adding a new input, for now it can default to deepseek-ai/DeepSeek-V4-Pro.
Remember, this change will require that we also dynamically set the model_prefix, exp_name, benchmark_script_override, and also the artifact names. As are result, also the Create PR step will also change.
There was a problem hiding this comment.
Made model_prefix a second input (default dsv4) instead of deriving it, to match how the repo already treats it as an explicit field (configs + launcher branch on it).
Address review: - Model is now a workflow input (model + model-prefix, default deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME, BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body are all derived from those inputs. The emitted YAML top-level key is now derived from the model (MODEL_KEY, defaults to the model basename lowercased). - Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d595d49. Configure here.
| al="N/A" | ||
| fi | ||
| echo " -> thinking=$mode MTP=$mtp AL=$al (accepted=$delta_acc drafts=$delta_drf)" | ||
| AL_RESULT["${mode}_${mtp}"]="$al" |
There was a problem hiding this comment.
Bench failure still records AL
High Severity
The collector uses set -uo pipefail without errexit and never checks whether vllm bench serve succeeded. After a failed or partial benchmark, it still diffs spec-decode metrics and may write numeric acceptance-length values into the golden YAML instead of failing the run or marking the cell unusable.
Reviewed by Cursor Bugbot for commit d595d49. Configure here.


Summary
Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrix —
thinking_on/off × MTP (num_speculative_tokens) 1–8— on the self-hosted B300 runners, and (optionally) opens a PR that updatesbenchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.Triggered manually via
workflow_dispatch(MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).What's in this PR
benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh(thinking, MTP)cell: start a vLLM server, run SPEED-Bench on one category, derive AL from/metrics(accepted_tokens / drafts + 1), and emit a YAML matrix identical in shape tobenchmarks/speedbench-reference-al.yaml.runners/launch_b300-nv.shBENCH_SCRIPT_OVERRIDE(run a specific script instead of the auto-selected throughput benchmark) andSALLOC_TIME_LIMIT(raise the Slurm time limit; the 16 server starts need more than the 180-min default)..github/workflows/speedbench-al.ymlworkflow_dispatchentry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.How it fits together
The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the
salloc/srun/enroot/mount logic.Model path handling
The collector serves from
SERVE_MODEL="${MODEL_PATH:-$MODEL}":MODELto the HF iddeepseek-ai/DeepSeek-V4-Pro; the launcher resolvesMODEL_PATHto the pre-staged local weights (its basename is inSTAGED_MODELS) and mounts them, so the collector serves locally with no download.MODEL_PATHis unset andMODELis itself a local path, so the same script works unchanged.Measurement config (for reviewers)
--speed-bench-output-len 4096), exposed as the workflowoutput-leninput. This is the recommended setting and is applied to every cell.--max-model-len 16384is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.coding; thinking-on cells usechat_template_kwargs = {"thinking": true, "reasoning_effort": "high"}to match the golden/production config.Deliberate, documented exception: temporary
--chat-template-kwargsshimThe collector contains a small monkeypatch shim (the
apply_chat_template_kwargs_shimfunction) that patchesvllm.benchmarksat runtime to add a real--chat-template-kwargsCLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:speed_bench/CustomDatasetpre-renders the chat template client-side withoutchat_template_kwargsand posts to/v1/completions, so thinking mode cannot be enabled via--extra-bodyor--default-chat-template-kwargs. The shim wires a proper--chat-template-kwargsthroughget_samplesintoCustomDataset.sample'sapply_chat_template.exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.TODOfor that removal.This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.
Backward compatibility
Both launcher hooks are pure opt-in (
${BENCH_SCRIPT_OVERRIDE:-},${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing${VAR:-default}switch pattern (EVAL_ONLY,RUN_EVAL, etc.).Test plan
workflow_dispatchwith a trimmed matrix (mtp-list: "1",thinking-modes: "off",open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).thinking_on: 1.79,thinking_off: 1.92).mtp-list: "1 2 3 4 5 6 7 8",thinking-modes: "off on") withopen-pr: true; review the auto-opened reference-YAML PR before merging.Note
Medium Risk
Touches GPU CI/Slurm launch paths and patches vLLM in-container for thinking-mode benchmarks; golden reference YAML updates are manual-review gated but wrong AL values would affect downstream synthetic-acceptance tests.
Overview
Adds a manual GitHub Action (
speedbench-al.yml) that runs on self-hosted B300 runners to collect a SPEED-Bench acceptance-length (AL) reference matrix (thinking_on/thinking_off× MTPnum_speculative_tokens), uploadspeedbench-reference-al.yaml(and server logs), and optionally open a PR that updatesbenchmarks/speedbench-reference-al.yaml.The new collector script
benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.shloops each matrix cell: start vLLM with MTP speculative config, run SPEED-Bench, derive AL from spec-decode Prometheus metrics, and emit YAML matching the golden reference shape. It includes a temporary runtime patch to vLLM’s benchmark CLI so thinking-on cells can passchat_template_kwargs(until upstream #44244), plus stricter server/GPU cleanup between cells.runners/launch_b300-nv.shgains opt-inBENCH_SCRIPT_OVERRIDE(workflow points at the speedbench script instead of auto-picked throughput benches) andSALLOC_TIME_LIMIT(default 180 minutes unchanged) for long multi-server runs.Reviewed by Cursor Bugbot for commit d595d49. Bugbot is set up for automated code reviews on this repo. Configure here.