adding vllm unit tests by kinjalpatel27 · Pull Request #1517 · NVIDIA/Model-Optimizer

kinjalpatel27 · 2026-05-18T21:04:16Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

Tests
- Added comprehensive GPU test suite for vLLM integration with quantization validation
- Extended GPU tests to cover additional model architectures including DeepSeek V3
- Added test utilities supporting DeepSeek V3 model construction
Chores
- Updated GPU test workflow with improved container image configuration
- Added vLLM test session to test infrastructure

copy-pr-bot · 2026-05-18T21:04:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T21:04:23Z

📝 Walkthrough

Walkthrough

This PR adds end-to-end GPU testing for vLLM fakequant quantization integration. It updates the GPU test CI matrix to use full container image paths, adds a new nox session for running GPU tests, introduces DeepSeek V3 test model utilities, configures pytest for vLLM RPC serialization, and includes a comprehensive test suite that validates quantizer module coverage and static-amax invariants across multiple model architectures.

Changes

vLLM Fakequant Test Suite

Layer / File(s)	Summary
CI & Test Runner Setup `.github/workflows/gpu_tests.yml`, `noxfile.py`	GPU test matrix refactored to store full container image paths; job container uses `${{ matrix.container_image }}` directly; test install uses `python3 -m pip`; new `@nox.session(venv_backend="none")` session installs project with `hf,dev-test` extras and runs `tests/gpu_vllm` pytest suite.
Test Model Utilities `tests/_test_utils/torch/transformers_models.py`	Adds `DeepseekV3Config` import and introduces `get_tiny_deepseek_v3()` and `create_tiny_deepseek_v3_dir()` helpers to construct and save minimal DeepSeek V3 models with fixed router/expert settings and explicit `topk_method` persistence.
Pytest Environment & vLLM Configuration `tests/gpu_vllm/conftest.py`	New conftest module sets `VLLM_ALLOW_INSECURE_SERIALIZATION=1` at import time (before vLLM imports) to enable RPC-based callable serialization for vLLM worker processes.
Quantization Worker Framework & Fixtures `tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py` (core infra)	Core RPC worker function `_quantize_and_summarize()` runs calibration forward loop during `mtq.quantize()` and verifies quantized module coverage and static-amax buffers on enabled `TensorQuantizer`s; lifecycle helpers boot and cleanly shut down `vllm.LLM` instances; three module-scoped pytest fixtures create tiny model directories, boot engines with architecture-specific overrides, and manage cleanup; assertion helper enforces static-amax regression invariant.
Quantization Test Cases `tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py` (tests)	End-to-end tests for TinyLlama, TinyQwen3-MoE, and TinyDeepseek-V3 validate quantized parallel linear and attention module coverage, MoE presence where expected, and static-amax invariants; registry validation tests ensure vLLM quantizable classes have `QuantModuleRegistry` entries, including MLAAttention when available.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'adding vllm unit tests' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changes.	Replace with a more specific title that captures the main change, such as 'Add GPU vLLM quantization tests with DeepSeek V3 support' or 'Integrate vLLM dynamic module quantization tests'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns detected: no unsafe torch.load, numpy.load, trust_remote_code, eval/exec, # nosec, or risky dependencies in PR modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kinjal/vllm_fq_unit

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-18T21:17:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.87%. Comparing base (3f88f0d) to head (6bd7465).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1517      +/-   ##
==========================================
- Coverage   76.92%   74.87%   -2.05%     
==========================================
  Files         474      474              
  Lines       51503    52827    +1324     
==========================================
- Hits        39618    39556      -62     
- Misses      11885    13271    +1386

Flag	Coverage Δ
gpu	`59.75% <ø> (-0.59%)`	⬇️
regression	`15.21% <ø> (+0.07%)`	⬆️
unit	`52.63% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py`:
- Around line 16-17: Move the test suite file
tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py into
one of the approved GPU test folders (e.g., tests/gpu or the appropriate
tests/gpu_* target) so it complies with the repository test-directory policy,
then update any CI and nox configuration entries that reference the old path
(search for occurrences of "gpu_vllm_fakequant" or the filename
"test_vllm_dynamic_modules.py" in noxfiles and workflow YAMLs) to point to the
new location and ensure the test is included in the correct GPU test collection.
- Around line 352-353: Replace the broad "except Exception as exc" that
unconditionally calls pytest.skip with a targeted check that only skips for
known environment/capability failures and re-raises all other exceptions: catch
specific exception types (e.g., ImportError, OSError, RuntimeError) or inspect
exc's message for known indicators (CUDA/vLLM model download failures) and call
pytest.skip only in those matched cases; for any other exc, re-raise the
exception so real regressions in the vLLM/model loading/quantization path
surface. Ensure this logic surrounds the same try block that currently ends with
pytest.skip and continues to use pytest.skip(...) for the allowed cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f28e2a71-b5a2-446b-a077-041bd9668aed

📥 Commits

Reviewing files that changed from the base of the PR and between 3f88f0d and 0ebeee6.

📒 Files selected for processing (4)

.github/workflows/gpu_tests.yml
noxfile.py
tests/gpu_vllm_fakequant/conftest.py
tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py

coderabbitai · 2026-05-18T23:37:47Z

+"""End-to-end tests for the vLLM fakequant dynamic modules.
+


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Test location breaks the repository’s test-directory policy.

This suite is placed under tests/gpu_vllm_fakequant, but the policy requires GPU tests to live in the listed standard folders. Please move this suite under an allowed GPU test directory and update nox/workflow paths accordingly.

As per coding guidelines tests/**/*.py: Use pytest for all tests with tests organized into: tests/unit for fast CPU-based tests, tests/gpu for GPU-based tests, tests/gpu_megatron for Megatron-Core GPU tests, and tests/gpu_trtllm for TensorRT-LLM GPU tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py` around lines 16 - 17, Move the test suite file tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py into one of the approved GPU test folders (e.g., tests/gpu or the appropriate tests/gpu_* target) so it complies with the repository test-directory policy, then update any CI and nox configuration entries that reference the old path (search for occurrences of "gpu_vllm_fakequant" or the filename "test_vllm_dynamic_modules.py" in noxfiles and workflow YAMLs) to point to the new location and ensure the test is included in the correct GPU test collection.

@kinjalpatel27 can you add tests/gpu_vllm in CONTRIBUTING.md

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/gpu_tests.yml:
- Around line 55-56: The vLLM image tag in the workflow (the line containing
"container_image: docker.io/vllm/vllm-openai:v0.21.0") is out of sync with
examples/vllm_serve/Dockerfile which uses v0.10.2; update the workflow entry to
use the same tag as the Dockerfile (change the container_image value to
docker.io/vllm/vllm-openai:v0.10.2) so the comment "Keep in sync with
examples/vllm_serve/Dockerfile" is honored, or conversely update the Dockerfile
to v0.21.0 if you intend to standardize on the newer release—ensure both places
reference the identical image tag.
- Line 56: The workflow pins the OpenAI-server image to container_image:
docker.io/vllm/vllm-openai:v0.21.0 which may be vulnerable to
GHSA-3mwp-wvh9-7528; verify whether that v0.21.0 release includes the patch and
if not either update the tag to the patched vLLM image (reference the patched
release tag) or add mitigations: enforce input validation and a max "n"
parameter and/or enable rate-limiting in the vLLM server invocation;
specifically, locate the container_image: docker.io/vllm/vllm-openai:v0.21.0
entry and either change it to the patched image tag or add environment
variables/command-line args or ingress rate-limiting configuration to bound the
`n` parameter and throttle requests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c96aa3eb-b472-440a-91c2-bf1b5151e78b

📥 Commits

Reviewing files that changed from the base of the PR and between 0ebeee6 and 3d4848a.

📒 Files selected for processing (3)

.github/workflows/gpu_tests.yml
noxfile.py
tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py

🚧 Files skipped from review as they are similar to previous changes (2)

noxfile.py
tests/gpu_vllm_fakequant/torch/quantization/test_vllm_dynamic_modules.py

coderabbitai · 2026-05-19T00:03:13Z

+            # Keep in sync with examples/vllm_serve/Dockerfile.
+            container_image: docker.io/vllm/vllm-openai:v0.21.0


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check if vLLM versions are in sync between workflow and Dockerfile # Extract version from this workflow file workflow_version=$(rg -o 'vllm/vllm-openai:(.+)' .github/workflows/gpu_tests.yml -r '$1' | head -n1) echo "Workflow version: $workflow_version" # Extract version from Dockerfile if it exists if [ -f examples/vllm_serve/Dockerfile ]; then dockerfile_version=$(rg -o 'vllm/vllm-openai:(.+)' examples/vllm_serve/Dockerfile -r '$1' | head -n1) echo "Dockerfile version: $dockerfile_version" if [ "$workflow_version" = "$dockerfile_version" ]; then echo "✓ Versions match" else echo "✗ Version mismatch detected" exit 1 fi else echo "⚠ examples/vllm_serve/Dockerfile not found" fi

Repository: NVIDIA/Model-Optimizer

Length of output: 148

Fix version mismatch: vLLM container image versions are out of sync.

The workflow specifies vllm/vllm-openai:v0.21.0, but examples/vllm_serve/Dockerfile uses v0.10.2. The comment at line 55 explicitly requires these to be kept in sync. Align both to the same version.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/gpu_tests.yml around lines 55 - 56, The vLLM image tag in the workflow (the line containing "container_image: docker.io/vllm/vllm-openai:v0.21.0") is out of sync with examples/vllm_serve/Dockerfile which uses v0.10.2; update the workflow entry to use the same tag as the Dockerfile (change the container_image value to docker.io/vllm/vllm-openai:v0.10.2) so the comment "Keep in sync with examples/vllm_serve/Dockerfile" is honored, or conversely update the Dockerfile to v0.21.0 if you intend to standardize on the newer release—ensure both places reference the identical image tag.

coderabbitai · 2026-05-19T00:03:13Z

GHSA-2pc9-4j83-qjmr

+          - example: gpu_vllm_fakequant
+            timeout: 30
+            # Keep in sync with examples/vllm_serve/Dockerfile.
+            container_image: docker.io/vllm/vllm-openai:v0.21.0


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

vllm-openai docker image v0.21.0 release date security advisories

💡 Result:

vllm-openai Docker image (vllm/vllm-openai) — v0.21.0 Release date - vLLM v0.21.0 was released on 2026-05-15. [1][2] Security advisories (known vLLM issues relevant to “vllm-openai”/the OpenAI-compatible server) - GHSA-2pc9-4j83-qjmr (RCE via auto_map dynamic module loading during model initialization): vLLM loads Hugging Face auto_map dynamic modules during model resolution without gating on trust_remote_code, enabling attacker-controlled Python code execution at server startup (before request handling). The advisory states affected versions are >=0.10.1,<0.14.0 and 0.14.0. [3] - GHSA-3mwp-wvh9-7528 (OOM DoS via unbounded n parameter in the OpenAI API server): unauthenticated attacker can send a request with an astronomically large n causing event-loop blocking and OOM by fan-out allocation in the OpenAI-compatible API server. [4] - GHSA-9hcf-v7m4-6m2j (server crash via invalid regex): a denial-of-service bug causes the vLLM server to crash if an invalid regex is provided while using structured output; patched in 0.9.0. [5] Notes on “v0.21.0” specifically - The v0.21.0 release notes emphasize multiple build/dependency changes and container provenance metadata, but the excerpted release page does not enumerate specific security fixes in that page view. [1] - Therefore, for security posture of v0.21.0 specifically, the primary authoritative check is whether your current vLLM version includes the fixes from the relevant GHSA advisories above, and whether any additional advisories were published after those pages.

Citations:

1: https://github.com/vllm-project/vllm/releases/tag/v0.21.0

2: https://pypi.org/project/vllm/

3: GHSA-2pc9-4j83-qjmr

4: GHSA-3mwp-wvh9-7528

5: GHSA-9hcf-v7m4-6m2j

Address potential OOM DoS vulnerability in vLLM OpenAI API server.

The Docker image v0.21.0 is valid and recently released (2026-05-15). However, vLLM has an outstanding security advisory (GHSA-3mwp-wvh9-7528) for an unauthenticated OOM denial-of-service vulnerability via unbounded n parameter in the OpenAI-compatible API server. Since the image docker.io/vllm/vllm-openai:v0.21.0 is specifically the OpenAI-compatible server variant, confirm whether v0.21.0 includes the patch for this vulnerability or ensure appropriate mitigations (rate limiting, input validation) are in place.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/gpu_tests.yml at line 56, The workflow pins the OpenAI-server image to container_image: docker.io/vllm/vllm-openai:v0.21.0 which may be vulnerable to GHSA-3mwp-wvh9-7528; verify whether that v0.21.0 release includes the patch and if not either update the tag to the patched vLLM image (reference the patched release tag) or add mitigations: enforce input validation and a max "n" parameter and/or enable rate-limiting in the vLLM server invocation; specifically, locate the container_image: docker.io/vllm/vllm-openai:v0.21.0 entry and either change it to the patched image tag or add environment variables/command-line args or ingress rate-limiting configuration to bound the `n` parameter and throttle requests.

kevalmorabia97 · 2026-05-19T09:03:00Z

can you keep folder name tests/gpu_vllm so its more generic than just fakequant

kevalmorabia97 · 2026-05-19T09:04:09Z

+import pytest
+import torch
+
+pytest.importorskip("vllm")


can you remove this. We should assume tests in this folder should only be run with vllm and fail otherwise instead of skipping

kevalmorabia97 · 2026-05-19T09:04:39Z

+@pytest.fixture(scope="session")
+def cuda_required():
+    if not torch.cuda.is_available():
+        pytest.skip("CUDA is required for vLLM fakequant tests")


This can be removed also. tests/gpu* can be assumed it should be run in cuda only

kevalmorabia97 · 2026-05-19T09:06:12Z

+pytest.importorskip("vllm")
+pytest.importorskip("transformers")


lets remove this also so tests dont get silently skipped if we accidentally change test container

kevalmorabia97 · 2026-05-19T09:06:46Z

+    import modelopt.torch.quantization as mtq
+    import modelopt.torch.quantization.plugins.vllm as vp
+    from modelopt.torch.quantization.nn import TensorQuantizer


can you move imports to the top

kevalmorabia97 · 2026-05-19T09:09:00Z

+    try:
+        from transformers import AutoModelForCausalLM, DeepseekV3Config
+    except ImportError:
+        pytest.skip("transformers in this environment does not ship DeepseekV3Config")


We use transformers>=4.56 so this should never raise an import error. Can you directly add import at the top of the file

kevalmorabia97 · 2026-05-19T09:09:23Z

+    _assert_quantizer_amax_is_static(summary)
+
+
+def _try_make_tiny_deepseek_v3_dir(tmp_path):


can you move to the test_utils folder with rest of tiny transformers models?

kevalmorabia97 · 2026-05-19T09:10:04Z

+    Pure registry check — no GPU / engine boot — so it runs even when the
+    heavier fixtures are skipped.
+    """
+    from modelopt.torch.quantization.nn import QuantModuleRegistry


imports to the top of the file

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/gpu_tests.yml (1)
25-33: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expand the PR gate to cover the new vLLM lane’s actual inputs.

Right now any_changed will stay false for changes in tests/_test_utils/** or examples/vllm_serve/Dockerfile, even though this lane now depends on the shared tiny-model builders and the workflow explicitly says the image pin must stay in sync with that Dockerfile. That lets helper regressions or image drift skip gpu_vllm entirely.
Suggested patch
       files: |
         .github/workflows/gpu_tests.yml
+        examples/vllm_serve/Dockerfile
         modelopt/**
         noxfile.py
         pyproject.toml
+        tests/_test_utils/**
         tests/gpu/**
         tests/gpu_megatron/**
         tests/gpu_trtllm/**
         tests/gpu_vllm/**
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/gpu_tests.yml around lines 25 - 33, The PR gate for the
gpu_vllm lane is missing inputs so any_changed stays false for changes to shared
tiny-model builders and the vLLM Dockerfile; update the files list in
.github/workflows/gpu_tests.yml (the block that populates any_changed) to
include tests/_test_utils/** and examples/vllm_serve/Dockerfile (and any other
tiny-model builder paths if separate) so changes to those files will trigger the
gpu_vllm lane and keep the image pin in sync with the Dockerfile referenced by
the gpu_vllm job.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py`:
- Around line 353-355: The test unconditionally asserts VllmMLAAttention exists
but that symbol can be None on builds without MLA; update the
test_registry_has_mla_attention function to first check if VllmMLAAttention is
None and call pytest.skip("MLA not available") (or return) when absent,
otherwise perform the existing assertions against QuantModuleRegistry; reference
the VllmMLAAttention symbol and the QuantModuleRegistry in the change.

---

Outside diff comments:
In @.github/workflows/gpu_tests.yml:
- Around line 25-33: The PR gate for the gpu_vllm lane is missing inputs so
any_changed stays false for changes to shared tiny-model builders and the vLLM
Dockerfile; update the files list in .github/workflows/gpu_tests.yml (the block
that populates any_changed) to include tests/_test_utils/** and
examples/vllm_serve/Dockerfile (and any other tiny-model builder paths if
separate) so changes to those files will trigger the gpu_vllm lane and keep the
image pin in sync with the Dockerfile referenced by the gpu_vllm job.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94d4592d-e642-4fab-aa1f-b84f904dafaa

📥 Commits

Reviewing files that changed from the base of the PR and between 3d4848a and 6bd7465.

📒 Files selected for processing (5)

.github/workflows/gpu_tests.yml
noxfile.py
tests/_test_utils/torch/transformers_models.py
tests/gpu_vllm/conftest.py
tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py

coderabbitai · 2026-05-20T01:13:31Z

+def test_registry_has_mla_attention():
+    assert VllmMLAAttention is not None
+    assert VllmMLAAttention in QuantModuleRegistry


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n -C3 '\bVllmMLAAttention\b' modelopt tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3579

Skip this test when MLA support is absent.

The plugin treats VllmMLAAttention as optional—it falls back to None when the vLLM build doesn't expose MLAAttention (see modelopt/torch/quantization/plugins/vllm.py lines 79–82). The rest of this file and the plugin already check is not None before using it, but this test unconditionally asserts the symbol exists, which will fail on builds without MLA support.

Suggested patch

def test_registry_has_mla_attention(): - assert VllmMLAAttention is not None + if VllmMLAAttention is None: + pytest.skip("This vLLM build does not expose MLAAttention.") assert VllmMLAAttention in QuantModuleRegistry

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_vllm/torch/quantization/test_vllm_dynamic_modules.py` around lines 353 - 355, The test unconditionally asserts VllmMLAAttention exists but that symbol can be None on builds without MLA; update the test_registry_has_mla_attention function to first check if VllmMLAAttention is None and call pytest.skip("MLA not available") (or return) when absent, otherwise perform the existing assertions against QuantModuleRegistry; reference the VllmMLAAttention symbol and the QuantModuleRegistry in the change.

kinjalpatel27 added 2 commits May 18, 2026 23:05

adding vllm unit tests

1cae4d7

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

minor

0ebeee6

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 force-pushed the kinjal/vllm_fq_unit branch from 7b708cb to 0ebeee6 Compare May 18, 2026 23:05

kinjalpatel27 marked this pull request as ready for review May 18, 2026 23:33

kinjalpatel27 requested a review from a team as a code owner May 18, 2026 23:33

kinjalpatel27 requested a review from kevalmorabia97 May 18, 2026 23:33

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

fixing failures

3d4848a

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

kinjalpatel27 marked this pull request as draft May 19, 2026 00:24

kevalmorabia97 reviewed May 19, 2026

View reviewed changes

cleanup

6bd7465

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 marked this pull request as ready for review May 20, 2026 01:06

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

		# Keep in sync with examples/vllm_serve/Dockerfile.
		container_image: docker.io/vllm/vllm-openai:v0.21.0

		pytest.importorskip("vllm")
		pytest.importorskip("transformers")

		_assert_quantizer_amax_is_static(summary)


		def _try_make_tiny_deepseek_v3_dir(tmp_path):

Conversation

kinjalpatel27 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kinjalpatel27 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot May 18, 2026 •

edited

Loading

kevalmorabia97 May 19, 2026 •

edited

Loading