Skip to content

Commit 7aa0c95

Browse files
Add tests/gpu_vllm (#1517)
### What does this PR do? Type of change: new tests This PR adds unit tests for vLLM fakequant, specifically testing code in `modelopt/torch/quantization/plugins/vllm.py` ### Testing ``` pytest tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py -sv ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: ✅ ### Additional Information <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added comprehensive GPU vLLM test suite with end-to-end quantization checks and fixtures for TinyLlama, TinyQwen3-MoE, and DeepSeek V3; includes helpers to build tiny DeepSeek V3 models. * **Chores** * Updated GPU CI to use explicit container image references, added a GPU-focused test session, and adjusted test-run setup for vLLM. * **Documentation** * Documented new GPU test directory in contributing guide. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1517?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent ed0a4b1 commit 7aa0c95

8 files changed

Lines changed: 369 additions & 19 deletions

File tree

.github/workflows/_example_tests_runner.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,6 @@ jobs:
3434
timeout-minutes: ${{ inputs.timeout_minutes }}
3535
container:
3636
image: ${{ inputs.docker_image }}
37-
credentials:
38-
username: $oauthtoken
39-
password: ${{ secrets.NGC_API_KEY }}
4037
options: --shm-size=2gb # TRT-LLM tests on 2-GPU runner needs more shared memory
4138
env:
4239
PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages

.github/workflows/gpu_tests.yml

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ jobs:
2929
tests/gpu/**
3030
tests/gpu_megatron/**
3131
tests/gpu_trtllm/**
32+
tests/gpu_vllm/**
3233
3334
gpu-tests:
3435
needs: [pr-gate]
@@ -39,25 +40,30 @@ jobs:
3940
include:
4041
- example: gpu
4142
timeout: 75
42-
container_image: pytorch:26.04-py3
43+
container_image: nvcr.io/nvidia/pytorch:26.04-py3
4344
- example: gpu_megatron
4445
timeout: 45
45-
container_image: nemo:26.04
46+
container_image: nvcr.io/nvidia/nemo:26.04
4647
- example: gpu_trtllm
4748
timeout: 30
48-
container_image: tensorrt-llm/release:1.3.0rc16
49+
container_image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc16
50+
- example: gpu_vllm
51+
timeout: 30
52+
container_image: docker.io/vllm/vllm-openai:v0.20.0
4953
runs-on: ${{ startsWith(github.ref, 'refs/heads/pull-request/') && 'linux-amd64-gpu-rtxpro6000-latest-1' || 'linux-amd64-gpu-rtxpro6000-latest-2' }}
5054
timeout-minutes: ${{ matrix.timeout }}
5155
container:
52-
image: nvcr.io/nvidia/${{ matrix.container_image }}
53-
credentials:
54-
username: $oauthtoken
55-
password: ${{ secrets.NGC_API_KEY }}
56+
image: ${{ matrix.container_image }}
5657
env:
5758
GIT_DEPTH: 1000 # For correct version
5859
PIP_CONSTRAINT: "" # Disable pip constraint for upgrading packages
5960
HF_TOKEN: ${{ secrets.HF_TOKEN }}
6061
steps:
62+
- name: Install git
63+
# The vllm container ships without git; needed for a real checkout (correct
64+
# setuptools-scm version) and for the Codecov upload below.
65+
if: matrix.example == 'gpu_vllm'
66+
run: apt-get update && apt-get install -y git
6167
- uses: actions/checkout@v6
6268
- uses: nv-gha-runners/setup-proxy-cache@main
6369
- name: Setup environment variables
@@ -68,7 +74,8 @@ jobs:
6874
COVERAGE_PROCESS_START: ${{ github.workspace }}/pyproject.toml
6975
COVERAGE_FILE: ${{ github.workspace }}/.coverage
7076
run: |
71-
python -m pip install nox && nox -s ${{ matrix.example }}
77+
# Use `python3` (the vllm image has no `python` on PATH)
78+
python3 -m pip install nox && nox -s ${{ matrix.example }}
7279
- name: Upload GPU coverage to Codecov
7380
uses: codecov/codecov-action@v5
7481
with:

CONTRIBUTING.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ We use [pytest](https://docs.pytest.org/) for all tests. For any new features /
146146
- `tests/gpu`: Fast GPU-based unit tests for the core ModelOpt library. In most cases, they should not take more than a few seconds to run.
147147
- `tests/gpu_megatron`: Fast GPU-based unit tests for the core ModelOpt library for Megatron-Core features. In most cases, they should not take more than a few seconds to run.
148148
- `tests/gpu_trtllm`: Fast GPU-based unit tests for the core ModelOpt library for TensorRT-LLM features. In most cases, they should not take more than a few seconds to run.
149+
- `tests/gpu_vllm`: Fast GPU-based unit tests for the core ModelOpt library for vLLM features. In most cases, they should not take more than a few seconds to run.
149150
- `tests/examples`: Integration tests for ModelOpt examples. They should not take more than a few minutes to run. Please refer to [example test README](./tests/examples/README.md) for more details.
150151

151152
For lightweight focused local validation, run `pytest` directly on the relevant test path. For example:

examples/vllm_serve/Dockerfile

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM vllm/vllm-openai:v0.10.2
1+
FROM vllm/vllm-openai:v0.20.0
22

33
# Set environment variables
44
ENV PIP_NO_CACHE_DIR=off \
@@ -23,17 +23,11 @@ RUN cd Model-Optimizer && \
2323
pip install -e ".[all,dev-test]"
2424

2525
# Llama4 requires this
26-
RUN pip install flash-attn==2.7.4.post1
26+
RUN pip install flash-attn==2.7.4.post1 --no-build-isolation
2727

2828
# Pre-compile CUDA extensions to avoid compilation time during runtime
2929
RUN python3 -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()" || true
3030

31-
# Install requirements from examples (excluding windows examples)
32-
RUN find Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while read req_file; do \
33-
echo "Installing from $req_file"; \
34-
pip install -r "$req_file" || echo "Warning: Failed to install from $req_file"; \
35-
done
36-
3731
# Allow users to run without root
3832
RUN chmod -R 777 /workspace
3933

noxfile.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,14 @@ def gpu_trtllm(session):
138138
session.run("python", "-m", "pytest", "tests/gpu_trtllm", *_cov_args())
139139

140140

141+
# Container: docker.io/vllm/vllm-openai (the published image ships vLLM + CUDA + torch).
142+
# Pin must stay in sync with examples/vllm_serve/Dockerfile.
143+
@nox.session(venv_backend="none")
144+
def gpu_vllm(session):
145+
session.run("python3", "-m", "pip", "install", "-e", ".[hf,dev-test]")
146+
session.run("python3", "-m", "pytest", "tests/gpu_vllm", *_cov_args())
147+
148+
141149
# Container: nvcr.io/nvidia/pytorch:26.01-py3 or later
142150
@nox.session(venv_backend="none")
143151
def regression(session):

tests/_test_utils/torch/transformers_models.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
AutoModelForQuestionAnswering,
2727
AutoTokenizer,
2828
BertConfig,
29+
DeepseekV3Config,
2930
GptOssConfig,
3031
LlamaConfig,
3132
PreTrainedModel,
@@ -120,6 +121,50 @@ def create_tiny_qwen3_moe_dir(
120121
return qwen3_moe_dir
121122

122123

124+
##### DeepSeek V3 #####
125+
def get_tiny_deepseek_v3(**config_kwargs) -> PreTrainedModel:
126+
set_seed(SEED)
127+
kwargs = {
128+
"dtype": torch.bfloat16,
129+
"vocab_size": 128,
130+
"hidden_size": 128,
131+
"intermediate_size": 256,
132+
"moe_intermediate_size": 64,
133+
"num_hidden_layers": 2,
134+
"num_attention_heads": 2,
135+
"num_key_value_heads": 2,
136+
"n_routed_experts": 4,
137+
"num_experts_per_tok": 2,
138+
"n_shared_experts": 1,
139+
"first_k_dense_replace": 0,
140+
"kv_lora_rank": 16,
141+
"q_lora_rank": 32,
142+
"qk_rope_head_dim": 16,
143+
"qk_nope_head_dim": 16,
144+
"v_head_dim": 16,
145+
"max_position_embeddings": 128,
146+
# Required so vLLM allocates ``gate.e_score_correction_bias`` (HF saves it unconditionally).
147+
"topk_method": "noaux_tc",
148+
}
149+
kwargs.update(**config_kwargs)
150+
cfg = DeepseekV3Config(**kwargs)
151+
# Survive transformers versions that drop unknown kwargs from the dataclass.
152+
cfg.topk_method = kwargs["topk_method"]
153+
return AutoModelForCausalLM.from_config(cfg)
154+
155+
156+
def create_tiny_deepseek_v3_dir(
157+
tmp_path: Path | str, with_tokenizer: bool = False, **config_kwargs
158+
) -> Path:
159+
deepseek_dir = Path(tmp_path) / "tiny_deepseek_v3"
160+
if with_tokenizer:
161+
tokenizer = get_tiny_tokenizer()
162+
tokenizer.save_pretrained(deepseek_dir)
163+
config_kwargs["vocab_size"] = tokenizer.vocab_size
164+
get_tiny_deepseek_v3(**config_kwargs).save_pretrained(deepseek_dir)
165+
return deepseek_dir
166+
167+
123168
##### GPT-OSS #####
124169
def get_tiny_gpt_oss(**config_kwargs) -> PreTrainedModel:
125170
set_seed(SEED)

tests/gpu_vllm/conftest.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""Set ``VLLM_ALLOW_INSECURE_SERIALIZATION=1`` before vLLM is imported so
17+
``LLM.collective_rpc(callable)`` can pickle worker callables. pytest loads
18+
conftests before sibling test modules, so this beats the top-level
19+
``from vllm import LLM`` in ``test_*.py``.
20+
"""
21+
22+
import os
23+
24+
os.environ.setdefault("VLLM_ALLOW_INSECURE_SERIALIZATION", "1")

0 commit comments

Comments
 (0)