Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2884,7 +2884,7 @@ dsr1-fp8-h200-sglang-mtp:
# Uses the cu129 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.
dsv4-fp8-h200-vllm:
image: vllm/vllm-openai:v0.21.0
image: vllm/vllm-openai:v0.22.0
Comment on lines 2884 to +2887
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The rationale comments above both dsv4-fp8-h200-vllm (lines 2883–2885) and dsv4-fp8-h200-vllm-mtp (lines 2907–2909) are stale and now contradict the image lines this PR is editing. The non-MTP block still says "Uses the cu129 image" but the entry pins the stock vllm/vllm-openai:v0.22.0 tag, and the MTP block claims it uses "the canonical v0.20.1 image" while the non-MTP entry above is "still on the deepseekv4-cu129 tag" — both factually wrong post-PR (MTP is v0.22.0, non-MTP is v0.22.0). Pre-existing staleness (the prior v0.21.0 bump already failed to refresh these), but since this PR edits both image lines, it would be a natural place to refresh the adjacent rationale. Severity: nit.

Extended reasoning...

What the bug is

This PR bumps both dsv4-fp8-h200-vllm (line 2887) and dsv4-fp8-h200-vllm-mtp (line 2911) from vllm/vllm-openai:v0.21.0 to vllm/vllm-openai:v0.22.0. Both entries have explanatory comment blocks immediately above them whose factual claims no longer match the image they document.

Comment #1 — above dsv4-fp8-h200-vllm (lines 2883–2885):

# Uses the cu129 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.
dsv4-fp8-h200-vllm:
  image: vllm/vllm-openai:v0.22.0   # <-- not a cu129-tagged image

The comment says "Uses the cu129 image" — historically this referred to a custom deepseekv4-cu129 build. The pinned tag is now the stock public semver v0.22.0, so the rationale-for-image-choice no longer applies.

Comment #2 — above dsv4-fp8-h200-vllm-mtp (lines 2907–2909):

# MTP variant of dsv4-fp8-h200-vllm. Uses the canonical v0.20.1 image
# (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
dsv4-fp8-h200-vllm-mtp:
  image: vllm/vllm-openai:v0.22.0

Both factual claims are wrong post-PR: (1) the MTP image is v0.22.0, not v0.20.1; (2) the non-MTP entry directly above is also v0.22.0, not deepseekv4-cu129.

Step-by-step proof

  1. Open .github/configs/nvidia-master.yaml after this PR is applied.
  2. Read lines 2883–2887. The comment claims "Uses the cu129 image"; the next non-comment line is image: vllm/vllm-openai:v0.22.0 — a stock public tag, not a cu129 build. Contradiction confirmed.
  3. Read lines 2907–2911. The comment claims "Uses the canonical v0.20.1 image" — but line 2911 reads image: vllm/vllm-openai:v0.22.0. Contradiction [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 confirmed.
  4. The same comment claims "the non-MTP entry above is still on the deepseekv4-cu129 tag" — but line 2887 (the non-MTP entry) reads image: vllm/vllm-openai:v0.22.0. Contradiction [NVIDIA] Add TRT 70B (FP8 and FP4) #2 confirmed.
  5. git log -p on this file shows PR [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0 #1461 bumped these entries from the original tags to v0.21.0 without updating the comments; the staleness is therefore pre-existing. This PR widens the gap (v0.21.0 → v0.22.0) but is editing the exact image lines the comments describe, so it is the natural place to refresh the rationale in the same change.

Impact

No runtime impact — these are YAML comments only. The hazard is reviewer/maintainer confusion: someone reading the file to understand why a specific image was chosen will get a misleading answer, and the MTP block in particular invites a "wait, the non-MTP entry must still be on cu129" mistake during the next bump.

How to fix

Either (a) drop the now-obsolete provenance sentences entirely and keep only the still-accurate parts (no FP4 path on H200, max-model-len 800k, MTP adds --speculative-config), or (b) replace the stale lines with neutral rationale that doesn't pin to a specific tag — e.g. for the MTP block, just say "MTP variant of dsv4-fp8-h200-vllm; adds --speculative-config ...." Avoid embedding image tags or cross-references between entries in prose, since those go stale on every bump.

Severity

Nit / pre-existing. The staleness was introduced before this PR (the v0.21.0 bump in #1461 already mismatched the cu129/v0.20.1 prose), and the runtime behavior is unaffected. Flagging because the PR is directly editing both image lines and is in the natural position to refresh the adjacent rationale in the same change.

model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: h200
Expand All @@ -2908,7 +2908,7 @@ dsv4-fp8-h200-vllm:
# (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
dsv4-fp8-h200-vllm-mtp:
image: vllm/vllm-openai:v0.21.0
image: vllm/vllm-openai:v0.22.0
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: h200
Expand Down
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3239,3 +3239,10 @@
- "Add --compilation-config '{\"mode\":3,\"cudagraph_mode\":\"PIECEWISE\"}' to vllm serve, mirroring `model.base_args` from the upstream recipe. `pass_config.fuse_minimax_qk_norm` from the recipe is intentionally omitted — it triggers an upstream NameError on ROCm because vllm/compilation/passes/pass_manager.py imports MiniMaxQKNormPass under `is_cuda()` (NVIDIA-only) while using it unconditionally"
- "Conditionally enable VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 per (TP, EP, CONC) — on for shapes where the AITER ASM paged-attention kernel exists in the gfx942 heuristic table (TP=2 EP=1 CONC<=16, TP=8 EP=8 CONC<=64), off otherwise. Above the thresholds vllm/v1/attention/backends/rocm_aiter_fa.py routes decode through aiter pa_fwd_asm and crashes with `RuntimeError: get_heuristic_kernel: cannot get heuristic kernel!` for MiniMax-M2.5's attention shape (gqa=6 block_size=32 qTile=0); below them the ASM auto-dispatch is the perf win the recipe wants. Thresholds confirmed across 17 bench cells + 3 eval cells in PR #1594 sweep run 26692603804. Mirrors the per-shape toggle pattern in benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh; can collapse to unconditional SHUFFLE=1 once AITER registers the missing kernel on gfx942"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1594

- config-keys:
- dsv4-fp8-h200-vllm
- dsv4-fp8-h200-vllm-mtp
description:
- "Update vLLM image from v0.21.0 to v0.22.0"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1597
Loading