Skip to content

Update new fixed-AR-MTP CI workflow for kimik2.5_int4, kimik2.5_fp4, …#1633

Open
haic0 wants to merge 3 commits into
mainfrom
haichen/Fixed-AR-MTP-benchmark
Open

Update new fixed-AR-MTP CI workflow for kimik2.5_int4, kimik2.5_fp4, …#1633
haic0 wants to merge 3 commits into
mainfrom
haichen/Fixed-AR-MTP-benchmark

Conversation

@haic0
Copy link
Copy Markdown
Collaborator

@haic0 haic0 commented Jun 1, 2026

[Summary] Implemented and validated CI support for the new Eagle3 and fixed-AR MTP benchmark paths.

For amd-master.yaml, added matrix coverage for:

kimik2.5-int4-mi355x-vllm-eagle3
kimik2.5-mxfp4-mi355x-vllm-eagle3
minimaxm2.5-fp8-mi355x-vllm-eagle3
kimik2.5-int4-mi355x-vllm-fixed-ar-mtp
kimik2.5-fp4-mi355x-vllm-fixed-ar-mtp
Added benchmark scripts for:

Kimi INT4 Eagle3
Kimi FP4/MXFP4 Eagle3
MiniMax FP8 Eagle3
Kimi INT4 fixed-AR MTP
Kimi FP4 fixed-AR MTP
MiniMax FP8 Eagle3 fixed-AR support
Updated CI workflow plumbing:

Added fixed-ar-mtp scenario support in matrix validation and generation.
Updated e2e-tests.yml to route fixed-AR MTP jobs through benchmark-tmpl.yml.
Updated benchmark-tmpl.yml to pass draft model, speculative token count, rejection method, and synthetic acceptance rates into benchmark scripts.
Updated launch_mi355x-amds.sh to resolve Eagle3 and fixed-AR MTP script names correctly.
Added mtp-fixed-ar-amd.yml for LiveCodeBench-based synthetic acceptance-rate generation.
Installed and ran actionlint; fixed workflow lint issues.
Validation completed:


Note

Medium Risk
Touches core sweep generation, workflow matrix routing, and privileged self-hosted GPU jobs; misconfiguration could spawn many expensive runs or wrong speculative configs, but changes are additive CI/benchmark plumbing without auth or data-path changes.

Overview
Adds Eagle3 and fixed-AR MTP benchmark coverage for Kimi K2.5 (INT4/MXFP4) and MiniMax M2.5 on MI355X, plus CI plumbing to run and parameterize those paths end-to-end.

Matrix & validation: amd-master.yaml gains new vLLM entries for Eagle3 fixed-seq-len sweeps and fixed-ar-mtp scenarios (draft model, synthetic acceptance rates, Eagle3 spec-decoding). generate_sweep_configs.py and validation.py introduce the fixed-ar-mtp scenario type, eagle3 as a spec-decoding mode, and pydantic models for synthetic MTP fields.

Workflows: benchmark-tmpl.yml passes draft model / speculative-token / rejection / synthetic-AR inputs into jobs. e2e-tests.yml splits fixed-ar-mtp matrix jobs into a dedicated sweep and includes them in result collection. New mtp-fixed-ar-amd.yml runs LiveCodeBench on ROCm vLLM, derives per-position acceptance rates from Prometheus metrics, and publishes synthetic AR artifacts.

Runners & scripts: launch_mi355x-amds.sh maps eagle3 script suffixes and selects *_fixed_AR.sh / *_mtp_fixed_AR.sh when SCENARIO_TYPE=fixed-ar-mtp. New single-node bash benchmarks start vLLM with Eagle3 standard or synthetic rejection configs and run throughput (optional eval).

Reviewed by Cursor Bugbot for commit e68ee0a. Bugbot is set up for automated code reviews on this repo. Configure here.

…and minimaxm2.5_fp8 models

Signed-off-by: root <root@gbt350-odcdh5-wbb3.png-odc.dcgpu>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Drop local benchmark outputs and logs from version control so the PR only contains CI workflow and benchmark script changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
@haic0
Copy link
Copy Markdown
Collaborator Author

haic0 commented Jun 5, 2026

@functionstackx HI Oren, Since .github/workflows/mtp-fixed-ar-amd.yml is new and not merged to main, it generally won’t appear as a selectable manual workflow on the Actions page yet.So could you merge this PR first so that i can test it in the real github CI workflow for the dry run test, thanks so much!

@haic0 haic0 marked this pull request as ready for review June 5, 2026 02:22
@haic0 haic0 requested a review from a team June 5, 2026 02:22
Comment thread runners/launch_mi355x-amds.sh
Comment on lines +60 to +68

vllm serve "$MODEL" --port "$PORT" \
--tensor-parallel-size="$TP" \
--gpu-memory-utilization "${GPU_MEMORY_UTILIZATION:-0.90}" \
--max-model-len "$MAX_MODEL_LEN" \
--trust-remote-code \
--no-enable-prefix-caching \
--max-num-seqs "$CONC" \
--mm-encoder-tp-mode data \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new benchmarks/single_node/kimik2.5_int4_mi355x_vllm_mtp_fixed_AR.sh is missing flags that every sibling script added in this same PR uses: the vllm serve command omits --enable-expert-parallel, --long-prefill-token-threshold 8192, and --max-num-batched-tokens 16384, and the SPECULATIVE_CONFIG heredoc omits draft_tensor_parallel_size. The matrix entry kimik2.5-int4-mi355x-vllm-fixed-ar-mtp routes to this script, so the int4 fixed-AR run will use TP-only sharding, default batched-token limits, and a draft model sharded across all 8 GPUs (vLLM falls back to main TP), making its throughput non-comparable to the new int4 eagle3 baseline that is the point of fixed-AR — and the TP=8 draft sharding may OOM. Copy the matching block from kimik2.5_fp4_mi355x_vllm_mtp_fixed_AR.sh.

Extended reasoning...

What the bug is

benchmarks/single_node/kimik2.5_int4_mi355x_vllm_mtp_fixed_AR.sh differs from every other new script in this PR in two related ways:

1. Missing vllm serve flags (lines 60-68). The serve invocation is:

vllm serve "$MODEL" --port "$PORT" \
--tensor-parallel-size="$TP" \
--gpu-memory-utilization "${GPU_MEMORY_UTILIZATION:-0.90}" \
--max-model-len "$MAX_MODEL_LEN" \
--trust-remote-code \
--no-enable-prefix-caching \
--max-num-seqs "$CONC" \
--mm-encoder-tp-mode data \
--speculative-config "$SPECULATIVE_CONFIG" > "$SERVER_LOG" 2>&1 &

It omits --enable-expert-parallel, --long-prefill-token-threshold 8192, and --max-num-batched-tokens 16384. The three direct comparison points all include them:

  • kimik2.5_fp4_mi355x_vllm_mtp_fixed_AR.sh (paired fp4 fixed-AR-mtp — same family, only precision differs) — HAS all three
  • kimik2.5_int4_mi355x_vllm_eagle3.sh (same int4 model, non-fixed-AR eagle3 baseline) — HAS all three
  • .github/workflows/mtp-fixed-ar-amd.yml defaults vllm-extra-args to exactly "--enable-expert-parallel --long-prefill-token-threshold 8192 --max-num-batched-tokens 16384" for Kimi-K2.5 — i.e., this PR itself declares these as the intended Kimi-K2.5 vLLM flags

2. Missing draft_tensor_parallel_size in SPECULATIVE_CONFIG (lines 42-49). The heredoc emits:

print(json.dumps({
    "method": "eagle3",
    "model": os.environ["DRAFT_MODEL"],
    "num_speculative_tokens": int(os.environ["NUM_SPECULATIVE_TOKENS"]),
    "rejection_sample_method": os.environ["REJECTION_SAMPLE_METHOD"],
    "synthetic_acceptance_rates": json.loads(os.environ["SYNTHETIC_ACCEPTANCE_RATES"]),
}))

Every other new SPECULATIVE_CONFIG in this PR explicitly includes "draft_tensor_parallel_size": int(os.environ.get("DRAFT_TENSOR_PARALLEL_SIZE", "1")): the fp4 fixed-AR-mtp sibling (line 52), the int4 eagle3 baseline (line 43), the fp4 eagle3 baseline (line 46), and the minimax fp8 fixed-AR script (line 51). The int4 fixed-AR-mtp is the only one omitting it.

Why existing code does not prevent it

runners/launch_mi355x-amds.sh routes kimik2.5-int4-mi355x-vllm-fixed-ar-mtp to this exact script via SCRIPT_FIXED_AR_MTP (since SCENARIO_TYPE=fixed-ar-mtp and no *_eagle3_fixed_AR.sh exists for int4, it falls through to *_mtp_fixed_AR.sh). There is no validation step that catches missing serve flags, and the SPECULATIVE_CONFIG validator (validate_synthetic_acceptance_rates in utils/matrix_logic/validation.py) only checks rate-list length, not draft-TP presence.

Concrete step-by-step manifestation

  1. Workflow e2e-tests.yml resolves the matrix entry for kimik2.5-int4-mi355x-vllm-fixed-ar-mtp (TP=8, ISL=1024/8192, conc 4..64) → test-sweep-fixed-ar-mtpbenchmark-tmpl.yml.
  2. benchmark-tmpl.yml exports DRAFT_MODEL=nvidia/Kimi-K2.5-Thinking-Eagle3, NUM_SPECULATIVE_TOKENS=3, REJECTION_SAMPLE_METHOD=synthetic, SYNTHETIC_ACCEPTANCE_RATES=[0.778774,0.57543,0.412793].
  3. launch_mi355x-amds.sh picks kimik2.5_int4_mi355x_vllm_mtp_fixed_AR.sh.
  4. The script builds SPECULATIVE_CONFIG without draft_tensor_parallel_size. vLLM eagle3 then defaults the draft TP to the main TP (=8) — even though the sibling int4 eagle3 baseline explicitly pins draft TP=1. The 8-way-sharded Eagle3 draft either OOMs (the draft model is small and may not shard cleanly across 8 GPUs) or, if it survives, runs with a fundamentally different draft layout than the baseline.
  5. The serve command launches Kimi-K2.5 (a large MoE) at TP=8 without expert-parallel, with vLLMs default --max-num-batched-tokens (much smaller than 16384) and no long-prefill threshold. The matching int4 eagle3 baseline (kimik2.5_int4_mi355x_vllm_eagle3.sh) launches with EP + 16384 batched tokens.
  6. The throughput numbers produced by the fixed-AR run cannot be directly compared to the eagle3 baseline — but the entire purpose of the fixed-AR-mtp matrix entry added in this PR is exactly that side-by-side comparison (fixed-AR emulates a known acceptance rate; eagle3 runs the real draft; the speedup is the headline number).

Impact

  • Best case: the run completes, produces throughput numbers, and the speedup vs. the eagle3 baseline is silently meaningless (different EP setting, different max-num-batched-tokens, different draft TP).
  • Worst case: the TP=8 Eagle3 draft OOMs at server startup and the int4 fixed-AR-mtp matrix entry fails outright.

Either way, the headline measurement this PR is adding for int4 is broken.

Fix

Copy the missing block verbatim from the fp4 sibling. After --tensor-parallel-size="$TP" add:

--enable-expert-parallel \
--long-prefill-token-threshold 8192 \
--max-num-batched-tokens 16384 \

And inside the SPECULATIVE_CONFIG dict add:

"draft_tensor_parallel_size": int(os.environ.get("DRAFT_TENSOR_PARALLEL_SIZE", "1")),

This makes the int4 fixed-AR-mtp script identical in shape to its fp4 sibling and consistent with the int4 eagle3 baseline it is meant to be compared against.

Comment on lines +1 to +30
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
DRAFT_MODEL \
NUM_SPECULATIVE_TOKENS \
REJECTION_SAMPLE_METHOD \
SYNTHETIC_ACCEPTANCE_RATES \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

# Set HIP_VISIBLE_DEVICES to match ROCR_VISIBLE_DEVICES for Ray compatibility in vLLM 0.14+
if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

SERVER_LOG=/workspace/server.log
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new benchmarks/single_node/minimaxm2.5_fp8_mi355x_vllm_eagle3_fixed_AR.sh script has no matching matrix entry in .github/configs/amd-master.yaml — there is no minimaxm2.5-fp8-mi355x-vllm-fixed-ar-mtp key, only kimik2.5 has fixed-ar-mtp scenarios. Since runners/launch_mi355x-amds.sh only selects this *_eagle3_fixed_AR.sh path when SCENARIO_TYPE=fixed-ar-mtp, the script can never be invoked via the existing pipeline. The PR description lists "MiniMax FP8 Eagle3 fixed-AR support" under added scripts — either add the matching matrix entry now or drop the script so it doesn't sit as dead code.

Extended reasoning...

What the bug is

The PR adds benchmarks/single_node/minimaxm2.5_fp8_mi355x_vllm_eagle3_fixed_AR.sh but does not add a corresponding matrix entry in .github/configs/amd-master.yaml. The PR description explicitly lists "MiniMax FP8 Eagle3 fixed-AR support" under Added benchmark scripts for:, but the analogous Added matrix coverage for: bullet only mentions the two kimik2.5 fixed-AR-MTP entries. The minimax script is orphaned.

How the dispatch resolves

In runners/launch_mi355x-amds.sh the script-selection block is:

SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x"
SCRIPT_FW="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
SCRIPT_FIXED_AR="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}_fixed_AR.sh"
SCRIPT_FIXED_AR_MTP="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}_mtp_fixed_AR.sh"
SCRIPT_FALLBACK="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh"
if [[ "$SCENARIO_TYPE" == "fixed-ar-mtp" && -f "$SCRIPT_FIXED_AR" ]]; then
    BENCHMARK_SCRIPT="$SCRIPT_FIXED_AR"
elif [[ "$SCENARIO_TYPE" == "fixed-ar-mtp" && -f "$SCRIPT_FIXED_AR_MTP" ]]; then
    BENCHMARK_SCRIPT="$SCRIPT_FIXED_AR_MTP"
...

SCRIPT_FIXED_AR resolves to minimaxm2.5_fp8_mi355x_vllm_eagle3_fixed_AR.sh only when:

  1. EXP_NAME starts with minimaxm2.5 (so SCRIPT_BASE = minimaxm2.5_fp8_mi355x)
  2. FRAMEWORK == vllm
  3. SPEC_DECODING == eagle3SPEC_SUFFIX = _eagle3
  4. SCENARIO_TYPE == fixed-ar-mtp

SCENARIO_TYPE is propagated into the benchmark workflow from the matrix entry's scenario-type field (see .github/workflows/e2e-tests.yml job test-sweep-fixed-ar-mtp, which filters to entries with scenario-type == 'fixed-ar-mtp' and passes them to benchmark-tmpl.yml). generate_sweep_configs.py only emits entries with that scenario-type when a master-config block has a scenarios.fixed-ar-mtp section.

Why no such matrix entry exists

A grep of .github/configs/amd-master.yaml for fixed-ar-mtp returns exactly two scenario blocks, both under kimik2.5:

  • kimik2.5-int4-mi355x-vllm-fixed-ar-mtp
  • kimik2.5-fp4-mi355x-vllm-fixed-ar-mtp

The three minimax entries in the file (minimaxm2.5-fp8-mi355x-vllm, -vllm-eagle3, -vllm-agentic) all use scenarios.fixed-seq-len or scenarios.agentic-coding — none uses fixed-ar-mtp. Critically, the new minimaxm2.5-fp8-mi355x-vllm-eagle3 entry (which presumably should have been the home of this script) only declares fixed-seq-len scenarios, so it routes to SCRIPT_FW (minimaxm2.5_fp8_mi355x_vllm_eagle3.sh) — the non-fixed-AR variant — not the new _fixed_AR.sh script.

Step-by-step proof of unreachability

  1. For the new script to run, some matrix entry must have scenario-type: fixed-ar-mtp AND model-prefix: minimaxm2.5.
  2. generate_sweep_configs.py emits scenario-type: fixed-ar-mtp only when iterating scenarios.fixed-ar-mtp of a config in amd-master.yaml (see fixed_ar_mtp_configs = scenarios.get(Fields.FIXED_AR_MTP.value, []) in generate_full_sweep).
  3. No minimax entry in amd-master.yaml has a scenarios.fixed-ar-mtp block.
  4. Therefore no generated matrix entry can ever pair model-prefix=minimaxm2.5 with scenario-type=fixed-ar-mtp.
  5. Therefore the runner-side resolver in launch_mi355x-amds.sh never sees SCENARIO_TYPE=fixed-ar-mtp for a minimax run, never picks SCRIPT_FIXED_AR, and the new script is never invoked.

The other workflow added in this PR — .github/workflows/mtp-fixed-ar-amd.yml — also cannot invoke this script: it runs vllm serve inline in workflow steps and never sources a benchmark script from benchmarks/single_node/.

Impact

No runtime breakage today: the existing CI still picks SCRIPT_FW for the minimax-eagle3 matrix entry and runs the non-fixed-AR sibling. The harm is misleading dead code in the merged tree:

  • The PR self-describes the feature as added ("MiniMax FP8 Eagle3 fixed-AR support") but the wiring is incomplete.
  • Future readers/editors will assume the file is wired up to CI and may modify it under that assumption — changes that will silently never run.
  • Diverges amd-master.yaml from the kimik2.5 fixed-AR-MTP pattern that is fully wired.

How to fix

Two options, pick one:

  1. Add the missing matrix entry — mirror kimik2.5-fp4-mi355x-vllm-fixed-ar-mtp but with model: MiniMaxAI/MiniMax-M2.5, precision: fp8, model-prefix: minimaxm2.5, and a scenarios.fixed-ar-mtp block (likely draft-model: thoughtworks/MiniMax-M2.5-Eagle3 per the override in mtp-fixed-ar-amd.yml, with appropriate num-speculative-tokens/synthetic-acceptance-rates). Then the existing dispatch picks up the new script as designed.
  2. Drop minimaxm2.5_fp8_mi355x_vllm_eagle3_fixed_AR.sh and re-add it in the follow-up PR that also adds the matrix entry. Avoids checking in unreachable code.

Resolve the MI355X launcher conflict by preserving upstream fixed_seq_len script layout while keeping Eagle3 and fixed-AR MTP script resolution for this PR.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7d40518. Configure here.

@@ -0,0 +1,100 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong benchmark_lib source path

High Severity

New Eagle3 and fixed-AR MTP benchmark scripts source benchmark_lib.sh via ../benchmark_lib.sh, which resolves under benchmarks/single_node/ where that file does not exist. Peer scripts in the same directory use ../../benchmark_lib.sh, so these jobs fail at startup when sourcing the library.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7d40518. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant