Skip to content

[NV] Add GitHub Action to collect SPEED-Bench AL matrix#1650

Open
qiching wants to merge 3 commits into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action
Open

[NV] Add GitHub Action to collect SPEED-Bench AL matrix#1650
qiching wants to merge 3 commits into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action

Conversation

@qiching
Copy link
Copy Markdown

@qiching qiching commented Jun 2, 2026

Summary

Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrixthinking_on/off × MTP (num_speculative_tokens) 1–8 — on the self-hosted B300 runners, and (optionally) opens a PR that updates benchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.

Triggered manually via workflow_dispatch (MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).

What's in this PR

File Role
benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh The AL collector. For each (thinking, MTP) cell: start a vLLM server, run SPEED-Bench on one category, derive AL from /metrics (accepted_tokens / drafts + 1), and emit a YAML matrix identical in shape to benchmarks/speedbench-reference-al.yaml.
runners/launch_b300-nv.sh Two opt-in hooks (both default to prior behavior): BENCH_SCRIPT_OVERRIDE (run a specific script instead of the auto-selected throughput benchmark) and SALLOC_TIME_LIMIT (raise the Slurm time limit; the 16 server starts need more than the 180-min default).
.github/workflows/speedbench-al.yml workflow_dispatch entry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.

How it fits together

speedbench-al.yml  --(env: BENCH_SCRIPT_OVERRIDE, SALLOC_TIME_LIMIT, MTP_LIST, ...)-->
  runners/launch_b300-nv.sh  --(salloc + srun --export=ALL into the vLLM container)-->
    dsv4_fp4_b300_vllm_speedbench_matrix.sh  -->  speedbench-reference-al.yaml

The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the salloc/srun/enroot/mount logic.

Model path handling

The collector serves from SERVE_MODEL="${MODEL_PATH:-$MODEL}":

  • In CI, the workflow sets MODEL to the HF id deepseek-ai/DeepSeek-V4-Pro; the launcher resolves MODEL_PATH to the pre-staged local weights (its basename is in STAGED_MODELS) and mounts them, so the collector serves locally with no download.
  • For a standalone local run, MODEL_PATH is unset and MODEL is itself a local path, so the same script works unchanged.

Measurement config (for reviewers)

  • Max OSL = 4096 (--speed-bench-output-len 4096), exposed as the workflow output-len input. This is the recommended setting and is applied to every cell.
  • --max-model-len 16384 is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.
  • Category defaults to coding; thinking-on cells use chat_template_kwargs = {"thinking": true, "reasoning_effort": "high"} to match the golden/production config.
  • The reference AL matrix was measured with exactly this config, so the values the Action produces are directly comparable.

Deliberate, documented exception: temporary --chat-template-kwargs shim

The collector contains a small monkeypatch shim (the apply_chat_template_kwargs_shim function) that patches vllm.benchmarks at runtime to add a real --chat-template-kwargs CLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:

  • Why it's needed: until vllm-project/vllm#44244 ships in the benchmark image, speed_bench/CustomDataset pre-renders the chat template client-side without chat_template_kwargs and posts to /v1/completions, so thinking mode cannot be enabled via --extra-body or --default-chat-template-kwargs. The shim wires a proper --chat-template-kwargs through get_samples into CustomDataset.sample's apply_chat_template.
  • Why it's safe: it is idempotent (guarded by a marker check, so re-running is a no-op), is applied only when a thinking-on cell is requested, asserts its anchors match exactly, and exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.
  • Lifecycle: delete the entire shim block once #44244 is released in the benchmark image. It is intentionally self-contained and marked TODO for that removal.

This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.

Backward compatibility

Both launcher hooks are pure opt-in (${BENCH_SCRIPT_OVERRIDE:-}, ${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing ${VAR:-default} switch pattern (EVAL_ONLY, RUN_EVAL, etc.).

Test plan

  • Test via workflow_dispatch with a trimmed matrix (mtp-list: "1", thinking-modes: "off", open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).
  • Confirm the produced YAML matches the expected shape and that thinking-on/off level-1 AL values are sane (locally observed: thinking_on: 1.79, thinking_off: 1.92).
  • Full run (mtp-list: "1 2 3 4 5 6 7 8", thinking-modes: "off on") with open-pr: true; review the auto-opened reference-YAML PR before merging.

Note

Medium Risk
Touches GPU CI/Slurm launch paths and patches vLLM in-container for thinking-mode benchmarks; golden reference YAML updates are manual-review gated but wrong AL values would affect downstream synthetic-acceptance tests.

Overview
Adds a manual GitHub Action (speedbench-al.yml) that runs on self-hosted B300 runners to collect a SPEED-Bench acceptance-length (AL) reference matrix (thinking_on / thinking_off × MTP num_speculative_tokens), upload speedbench-reference-al.yaml (and server logs), and optionally open a PR that updates benchmarks/speedbench-reference-al.yaml.

The new collector script benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh loops each matrix cell: start vLLM with MTP speculative config, run SPEED-Bench, derive AL from spec-decode Prometheus metrics, and emit YAML matching the golden reference shape. It includes a temporary runtime patch to vLLM’s benchmark CLI so thinking-on cells can pass chat_template_kwargs (until upstream #44244), plus stricter server/GPU cleanup between cells.

runners/launch_b300-nv.sh gains opt-in BENCH_SCRIPT_OVERRIDE (workflow points at the speedbench script instead of auto-picked throughput benches) and SALLOC_TIME_LIMIT (default 180 minutes unchanged) for long multi-server runs.

Reviewed by Cursor Bugbot for commit d595d49. Bugbot is set up for automated code reviews on this repo. Configure here.

Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro
SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on
self-hosted B300 runners, optionally opening a PR that updates
benchmarks/speedbench-reference-al.yaml.

- benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh:
  per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from
  /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local
  pre-staged weights resolved by the launcher), falling back to MODEL for
  a standalone local run. Carries a temporary --chat-template-kwargs shim
  until vllm-project/vllm#44244 lands in the benchmark image (idempotent,
  applied only for thinking-on cells).
- runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and
  SALLOC_TIME_LIMIT hooks; both default to the prior behavior.
- .github/workflows/speedbench-al.yml: workflow_dispatch entry point;
  MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
Make the workflow default to Option 1 (upload the AL matrix as an
artifact for manual review/paste) rather than auto-opening a PR. The
auto-PR path stays available as an opt-in (open-pr: true), but keeping
it off by default avoids exposing a write-scoped PAT on the self-hosted
runner and matches the repo's artifact-collection convention.
@qiching qiching changed the title Add GitHub Action to collect SPEED-Bench AL matrix [NV] Add GitHub Action to collect SPEED-Bench AL matrix Jun 3, 2026
Comment thread .github/workflows/speedbench-al.yml Outdated
# HF id; its basename (DeepSeek-V4-Pro) is in the launcher's STAGED_MODELS, so
# the launcher resolves MODEL_PATH to the pre-staged local weights and mounts
# them. The collector serves from MODEL_PATH (see SERVE_MODEL), so no download.
MODEL: deepseek-ai/DeepSeek-V4-Pro
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please update the model to be ${{ inputs.model }}? This would require adding a new input, for now it can default to deepseek-ai/DeepSeek-V4-Pro.

Remember, this change will require that we also dynamically set the model_prefix, exp_name, benchmark_script_override, and also the artifact names. As are result, also the Create PR step will also change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made model_prefix a second input (default dsv4) instead of deriving it, to match how the repo already treats it as an explicit field (configs + launcher branch on it).

Comment thread benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh
Address review:
- Model is now a workflow input (model + model-prefix, default
  deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME,
  BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body
  are all derived from those inputs. The emitted YAML top-level key is now
  derived from the model (MODEL_KEY, defaults to the model basename lowercased).
- Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh
  and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.
@qiching qiching marked this pull request as ready for review June 4, 2026 22:21
@qiching qiching requested a review from a team June 4, 2026 22:21
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d595d49. Configure here.

al="N/A"
fi
echo " -> thinking=$mode MTP=$mtp AL=$al (accepted=$delta_acc drafts=$delta_drf)"
AL_RESULT["${mode}_${mtp}"]="$al"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bench failure still records AL

High Severity

The collector uses set -uo pipefail without errexit and never checks whether vllm bench serve succeeded. After a failed or partial benchmark, it still diffs spec-decode metrics and may write numeric acceptance-length values into the golden YAML instead of failing the run or marking the cell unusable.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d595d49. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants