[NV] Add GitHub Action to collect SPEED-Bench AL matrix by qiching · Pull Request #1650 · SemiAnalysisAI/InferenceX

qiching · 2026-06-02T21:33:20Z

Summary

Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrix — thinking_on/off × MTP (num_speculative_tokens) 1–8 — on the self-hosted B300 runners, and (optionally) opens a PR that updates benchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.

Triggered manually via workflow_dispatch (MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).

What's in this PR

File	Role
`benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh`	The AL collector. For each `(thinking, MTP)` cell: start a vLLM server, run SPEED-Bench on one category, derive AL from `/metrics` (`accepted_tokens / drafts + 1`), and emit a YAML matrix identical in shape to `benchmarks/speedbench-reference-al.yaml`.
`runners/launch_b300-nv.sh`	Two opt-in hooks (both default to prior behavior): `BENCH_SCRIPT_OVERRIDE` (run a specific script instead of the auto-selected throughput benchmark) and `SALLOC_TIME_LIMIT` (raise the Slurm time limit; the 16 server starts need more than the 180-min default).
`.github/workflows/speedbench-al.yml`	`workflow_dispatch` entry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.

How it fits together

speedbench-al.yml  --(env: BENCH_SCRIPT_OVERRIDE, SALLOC_TIME_LIMIT, MTP_LIST, ...)-->
  runners/launch_b300-nv.sh  --(salloc + srun --export=ALL into the vLLM container)-->
    dsv4_fp4_b300_vllm_speedbench_matrix.sh  -->  speedbench-reference-al.yaml

The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the salloc/srun/enroot/mount logic.

Model path handling

The collector serves from SERVE_MODEL="${MODEL_PATH:-$MODEL}":

In CI, the workflow sets MODEL to the HF id deepseek-ai/DeepSeek-V4-Pro; the launcher resolves MODEL_PATH to the pre-staged local weights (its basename is in STAGED_MODELS) and mounts them, so the collector serves locally with no download.
For a standalone local run, MODEL_PATH is unset and MODEL is itself a local path, so the same script works unchanged.

Measurement config (for reviewers)

Max OSL = 4096 (--speed-bench-output-len 4096), exposed as the workflow output-len input. This is the recommended setting and is applied to every cell.
--max-model-len 16384 is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.
Category defaults to coding; thinking-on cells use chat_template_kwargs = {"thinking": true, "reasoning_effort": "high"} to match the golden/production config.
The reference AL matrix was measured with exactly this config, so the values the Action produces are directly comparable.

Deliberate, documented exception: temporary `--chat-template-kwargs` shim

The collector contains a small monkeypatch shim (the apply_chat_template_kwargs_shim function) that patches vllm.benchmarks at runtime to add a real --chat-template-kwargs CLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:

Why it's needed: until vllm-project/vllm#44244 ships in the benchmark image, speed_bench/CustomDataset pre-renders the chat template client-side without chat_template_kwargs and posts to /v1/completions, so thinking mode cannot be enabled via --extra-body or --default-chat-template-kwargs. The shim wires a proper --chat-template-kwargs through get_samples into CustomDataset.sample's apply_chat_template.
Why it's safe: it is idempotent (guarded by a marker check, so re-running is a no-op), is applied only when a thinking-on cell is requested, asserts its anchors match exactly, and exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.
Lifecycle: delete the entire shim block once #44244 is released in the benchmark image. It is intentionally self-contained and marked TODO for that removal.

This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.

Backward compatibility

Both launcher hooks are pure opt-in (${BENCH_SCRIPT_OVERRIDE:-}, ${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing ${VAR:-default} switch pattern (EVAL_ONLY, RUN_EVAL, etc.).

Test plan

Test via workflow_dispatch with a trimmed matrix (mtp-list: "1", thinking-modes: "off", open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).
Confirm the produced YAML matches the expected shape and that thinking-on/off level-1 AL values are sane (locally observed: thinking_on: 1.79, thinking_off: 1.92).
Full run (mtp-list: "1 2 3 4 5 6 7 8", thinking-modes: "off on") with open-pr: true; review the auto-opened reference-YAML PR before merging.

Note

Medium Risk
Touches GPU CI/Slurm launch paths and patches vLLM in-container for thinking-mode benchmarks; golden reference YAML updates are manual-review gated but wrong AL values would affect downstream synthetic-acceptance tests.

Overview
Adds a manual GitHub Action (speedbench-al.yml) that runs on self-hosted B300 runners to collect a SPEED-Bench acceptance-length (AL) reference matrix (thinking_on / thinking_off × MTP num_speculative_tokens), upload speedbench-reference-al.yaml (and server logs), and optionally open a PR that updates benchmarks/speedbench-reference-al.yaml.

The new collector script benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh loops each matrix cell: start vLLM with MTP speculative config, run SPEED-Bench, derive AL from spec-decode Prometheus metrics, and emit YAML matching the golden reference shape. It includes a temporary runtime patch to vLLM’s benchmark CLI so thinking-on cells can pass chat_template_kwargs (until upstream #44244), plus stricter server/GPU cleanup between cells.

runners/launch_b300-nv.sh gains opt-in BENCH_SCRIPT_OVERRIDE (workflow points at the speedbench script instead of auto-picked throughput benches) and SALLOC_TIME_LIMIT (default 180 minutes unchanged) for long multi-server runs.

^{Reviewed by Cursor Bugbot for commit d595d49. Bugbot is set up for automated code reviews on this repo. Configure here.}

Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.

Make the workflow default to Option 1 (upload the AL matrix as an artifact for manual review/paste) rather than auto-opening a PR. The auto-PR path stays available as an opt-in (open-pr: true), but keeping it off by default avoids exposing a write-scoped PAT on the self-hosted runner and matches the repo's artifact-collection convention.

Ankur-singh · 2026-06-04T00:22:18Z

+  # HF id; its basename (DeepSeek-V4-Pro) is in the launcher's STAGED_MODELS, so
+  # the launcher resolves MODEL_PATH to the pre-staged local weights and mounts
+  # them. The collector serves from MODEL_PATH (see SERVE_MODEL), so no download.
+  MODEL: deepseek-ai/DeepSeek-V4-Pro


Can we please update the model to be ${{ inputs.model }}? This would require adding a new input, for now it can default to deepseek-ai/DeepSeek-V4-Pro.

Remember, this change will require that we also dynamically set the model_prefix, exp_name, benchmark_script_override, and also the artifact names. As are result, also the Create PR step will also change.

Made model_prefix a second input (default dsv4) instead of deriving it, to match how the repo already treats it as an explicit field (configs + launcher branch on it).

Address review: - Model is now a workflow input (model + model-prefix, default deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME, BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body are all derived from those inputs. The emitted YAML top-level key is now derived from the model (MODEL_KEY, defaults to the model basename lowercased). - Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d595d49. Configure here.}

cursor · 2026-06-04T22:22:37Z

+        al="N/A"
+    fi
+    echo "  -> thinking=$mode MTP=$mtp AL=$al (accepted=$delta_acc drafts=$delta_drf)"
+    AL_RESULT["${mode}_${mtp}"]="$al"


Bench failure still records AL

High Severity

The collector uses set -uo pipefail without errexit and never checks whether vllm bench serve succeeded. After a failed or partial benchmark, it still diffs spec-decode metrics and may write numeric acceptance-length values into the golden YAML instead of failing the run or marking the cell unusable.

^{Reviewed by Cursor Bugbot for commit d595d49. Configure here.}

github-project-automation Bot added this to InferenceMAX Board Jun 2, 2026

xinli-sw mentioned this pull request Jun 2, 2026

[Tracking Issue] Synthetic Acceptance for MTP Benchmarks #1651

Open

3 tasks

qiching changed the title ~~Add GitHub Action to collect SPEED-Bench AL matrix~~ [NV] Add GitHub Action to collect SPEED-Bench AL matrix Jun 3, 2026

Ankur-singh reviewed Jun 4, 2026

View reviewed changes

qiching marked this pull request as ready for review June 4, 2026 22:21

qiching requested a review from a team June 4, 2026 22:21

claude Bot reviewed Jun 4, 2026

View reviewed changes

cursor Bot reviewed Jun 4, 2026

View reviewed changes

functionstackx mentioned this pull request Jun 5, 2026

Add SPEED-Bench reference synthetic AL values for DeepSeek-V4-Pro MTP 1-8 #1592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add GitHub Action to collect SPEED-Bench AL matrix#1650

[NV] Add GitHub Action to collect SPEED-Bench AL matrix#1650
qiching wants to merge 3 commits into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action

qiching commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

Ankur-singh Jun 4, 2026

Uh oh!

qiching Jun 4, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qiching commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

How it fits together

Model path handling

Measurement config (for reviewers)

Deliberate, documented exception: temporary --chat-template-kwargs shim

Backward compatibility

Test plan

Uh oh!

Ankur-singh Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

qiching Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Bench failure still records AL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qiching commented Jun 2, 2026 •

edited by cursor Bot

Loading

Deliberate, documented exception: temporary `--chat-template-kwargs` shim