Skip to content

[NVIDIA] Upgrade vLLM to v0.11.2#273

Merged
cquil11 merged 9 commits into
mainfrom
dev-pohanh-vllm-v0.11.2
Dec 5, 2025
Merged

[NVIDIA] Upgrade vLLM to v0.11.2#273
cquil11 merged 9 commits into
mainfrom
dev-pohanh-vllm-v0.11.2

Conversation

@ankursingh-nv
Copy link
Copy Markdown
Contributor

@ankursingh-nv ankursingh-nv commented Dec 3, 2025

Updated configs:

  • Use FP8 kv-cache for GPT-OSS B200.
  • Remove "custom_ops" from compilation-config for GPT-OSS.
  • Remove "cudagraph_mode" from compilation-config for GPT-OSS.
  • Remove VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB env var for GPT-OSS.
  • Remove deprecated "--disable-log-requests" flag.
  • Rename "cuda-graph-sizes" flag.

Test sweep: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19946962635

Updated configs:

- Use FP8 kv-cache for GPT-OSS B200.
- Remove "custom_ops" from compilation-config for GPT-OSS.
- Remove "cudagraph_mode" from compilation-config for GPT-OSS.
- Remove VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB env var for
  GPT-OSS.
- Remove deprecated "--disable-log-requests" flag.
- Rename "cuda-graph-sizes" flag.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
@ankursingh-nv ankursingh-nv requested a review from a team as a code owner December 3, 2025 17:31
@ankursingh-nv ankursingh-nv marked this pull request as draft December 3, 2025 17:41
@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 3, 2025

vLLM releasin too fast @mgoin we can't keep up

@cquil11 cquil11 marked this pull request as ready for review December 4, 2025 15:47
@cquil11 cquil11 marked this pull request as draft December 4, 2025 15:47
@cquil11 cquil11 temporarily deployed to fork-pr-validation December 4, 2025 15:48 — with GitHub Actions Inactive
@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 4, 2025

@ankursingh-nv is this ready to go once the checks pass?

@ankursingh-nv
Copy link
Copy Markdown
Contributor Author

ankursingh-nv commented Dec 4, 2025

@cquil11 should be ready once we get a successful E2E run.

Currently some h100 and h200 jobs are failing (refer https://github.com/InferenceMAX/InferenceMAX/actions/runs/19904064308)

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 4, 2025

@ankursingh-nv with vllm 0.11.2 it seems vllm server process attempts to write to host fs, so I added --container-writable to the CoreWeave runners as part of this PR

@cquil11 cquil11 marked this pull request as ready for review December 4, 2025 19:46
@ankursingh-nv
Copy link
Copy Markdown
Contributor Author

ankursingh-nv commented Dec 4, 2025

@cquil11 interesting.

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- thank you!

Comment thread benchmarks/gptoss_fp4_h100_docker.sh
Comment thread benchmarks/gptoss_fp4_h100_slurm.sh
Comment thread benchmarks/gptoss_fp4_h200_slurm.sh
Comment thread runners/launch_h200-cw.sh
@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Dec 5, 2025

@Oseltamivir + viz
they are using FP8 KV cache for GPT OSS now too haha

@cquil11 cquil11 merged commit 25506e8 into main Dec 5, 2025
@cquil11 cquil11 deleted the dev-pohanh-vllm-v0.11.2 branch December 5, 2025 21:49
@cquil11 cquil11 changed the title Upgrade vLLM to v0.11.2 [NVIDIA] Upgrade vLLM to v0.11.2 Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants