Skip to content
2 changes: 1 addition & 1 deletion .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2632,7 +2632,7 @@ kimik2.5-int4-h200-vllm-agentic:
- { tp: 8, offloading: cpu, conc-list: [6, 7, 8, 9, 10, 11, 12, 13, 14] }

kimik2.5-fp4-b200-vllm:
image: vllm/vllm-openai:v0.17.0
image: vllm/vllm-openai:v0.20.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR title and description state the vLLM image is being bumped to v0.21.0, but the actual diff updates the image to v0.20.2 in both nvidia-master.yaml (line 2503) and the perf-changelog.yaml entry. This is a metadata-only mismatch — please reconcile before merge by either updating the title/description to say v0.20.2, or bumping the YAML/changelog to v0.21.0 if that was the intended target.

Extended reasoning...

The mismatch

The PR title is "Update kimik2.5-fp4-b200-vllm vLLM image to v0.21.0" and the description states "Updates the vLLM image tag for kimik2.5-fp4-b200-vllm from v0.17.0 to v0.21.0".

However, the actual diff tells a different story:

  • .github/configs/nvidia-master.yaml line 2503 changes from image: vllm/vllm-openai:v0.17.0 to image: vllm/vllm-openai:v0.20.2
  • The new perf-changelog.yaml entry describes the change as "Update vLLM image from v0.17.0 to v0.20.2"

So the YAML and the changelog are internally consistent at v0.20.2, while the PR title and description both claim v0.21.0.

Step-by-step proof

  1. Open the PR view on GitHub — the title reads Update kimik2.5-fp4-b200-vllm vLLM image to v0.21.0.
  2. Read the description — it says Updates the vLLM image tag for kimik2.5-fp4-b200-vllm from v0.17.0 to v0.21.0.
  3. Inspect the diff on .github/configs/nvidia-master.yaml:
    kimik2.5-fp4-b200-vllm:
    -  image: vllm/vllm-openai:v0.17.0
    +  image: vllm/vllm-openai:v0.20.2
  4. Inspect the diff on perf-changelog.yaml:
    - config-keys:
        - kimik2.5-fp4-b200-vllm
      description:
        - "Update vLLM image from v0.17.0 to v0.20.2"
  5. Compare: title/description say v0.21.0, but the actual deployable artifact and changelog say v0.20.2.

Impact

This is metadata-only — when this PR merges, the deployed image will be vllm/vllm-openai:v0.20.2 as written in the YAML, not v0.21.0. The PR title and description don't affect the runtime behavior. However:

  • Reviewers reading the title/description in the GitHub UI will be misled about which version is being approved.
  • Future archaeologists running git log will see a commit message claiming a v0.21.0 bump that didn't actually happen.
  • The merge commit message (which typically picks up the PR title) will permanently embed the incorrect version in git history.

How to fix

Reconcile by either:

  • Option A (most likely correct): Update the PR title to Update kimik2.5-fp4-b200-vllm vLLM image to v0.20.2 and update the description body to match, since the YAML and changelog both consistently say v0.20.2.
  • Option B: If v0.21.0 was actually the intended target, bump nvidia-master.yaml line 2503 to v0.21.0 and update the perf-changelog.yaml description accordingly.

Given that the YAML and the auto-generated changelog entry agree on v0.20.2, Option A appears to be the actual intent.

model: nvidia/Kimi-K2.5-NVFP4
model-prefix: kimik2.5
runner: b200
Expand Down
7 changes: 7 additions & 0 deletions benchmarks/single_node/kimik2.5_fp4_b200.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

# vLLM v0.20.2+'s CUDA-graph memory profiler pre-reserves ~57 GB/GPU upfront
# (~32% of total), which collides with --gpu-memory-utilization=0.90 and
# leaves negative space for the KV cache. Disable the profiler — our 0.90
# already leaves ~18 GB/GPU as safety net (same pattern as
# benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh).
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size=$TP \
Expand Down
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2704,3 +2704,10 @@
description:
- "Update vLLM image from v0.19.0-cu130 (25d old) to v0.21.0"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1448

- config-keys:
- kimik2.5-fp4-b200-vllm
description:
- "Update vLLM image from v0.20.2 to v0.21.0"
- "Add VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 to disable aggressive CUDA-graph memory profiler that OOMs the KV cache"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1395