vLLM fakequant fold weight_quantizer for megatron export by kinjalpatel27 · Pull Request #1246 · NVIDIA/Model-Optimizer

kinjalpatel27 · 2026-04-13T15:11:16Z

What does this PR do?

Type of change: Bug fix

During Megatron→vLLM fakequant export (export_mcore_gpt_to_hf_vllm_fq), the weight_quantizer is now applied as fake-quantization (quantize + dequantize) directly into the exported weight tensor, and its amax is no longer saved to quantizer_state.pth. On reload, if weight_quantizer keys are absent from the checkpoint (because they were folded at export time), the corresponding quantizer modules are disabled.
This change is useful especially when amax across experts are not synced for weight_quantizer, this allows the weight_quantizer to keep them different for better accuracy.

Usage

# Unchanged — export API is the same                                                                                                                                                                         
export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...)

Testing

Step 1 — Quantize (run from Megatron-LM examples/post_training/modelopt):

HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \
bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG

Step 2 — Export for vLLM fakequant:

MLM_EXTRA_ARGS=--export-vllm-fq \ 
HF_MODEL_CKPT=<path/to/hf/weights> \ 
MLM_MODEL_CKPT=<quant-ckpt-name> \ 
EXPORT_DIR=<export-dir> \ 
bash export.sh <hf-model-id>

Step 3 — Serve (run from examples/vllm_serve):

 QUANT_CFG=NVFP4_DEFAULT_CFG \ 
 QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ 
 python3 vllm_serve_fakequant.py <export-dir> \ 
 -tp 1 --served-model-name <model-name> \ 
 --host 0.0.0.0 --port 8000 \ 
--trust-remote-code --enforce-eager \ 
 --disable-custom-all-reduce \ 
--gpu-memory-utilization 0.8

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

Bug Fixes
- Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures.
Improvements
- Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

copy-pr-bot · 2026-04-13T15:11:21Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-13T15:11:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Changes modify quantizer weight handling in vLLM export and reload workflows. The export plugin now folds enabled weight quantizers into exported weights and skips exporting weight-quantizer tensors. The reload utility validates missing quantizer keys and disables corresponding modules for weight quantizers to prevent folding during checkpoint loading. One whitespace-only change was added.

Changes

Cohort / File(s)	Summary
Whitespace Formatting `examples/vllm_serve/fakequant_worker.py`	Added a blank line after `trust_remote_code` computation; no functional changes.
Quantizer State Validation & Module Disabling `examples/vllm_serve/vllm_reload_utils.py`	Removed global quantizer-key-count check and `warnings` import. Now validates missing checkpoint keys: permits missing keys that contain `weight_quantizer` only if the missing key can be mapped to a module path; records such module paths and calls `module.disable()` for modules with a `disable()` method to prevent weight folding. Other missing quantizer keys still raise `ValueError`.
Weight Quantizer Folding & Export `modelopt/torch/export/plugins/vllm_fakequant_megatron.py`	When exporting weights, if `module.weight_quantizer` exists and is enabled, apply fake-quantization to the weight under `torch.no_grad()` (temporarily moving quantizer buffers to a CUDA device if needed), convert to target dtype, then move weight to CPU for export. Skip exporting quantizer state items whose names include `weight_quantizer` while still exporting other quantizer state.

Sequence Diagram(s)

sequenceDiagram
    participant Exporter
    participant Module
    participant Quantizer
    participant CPU
    participant CUDA
    Exporter->>Module: inspect weight, check weight_quantizer
    alt weight_quantizer exists and enabled
        Exporter->>Quantizer: move quantizer buffers -> CUDA (if weight on CPU and CUDA available)
        Exporter->>Quantizer: apply quantizer(weight) under no_grad
        Quantizer-->>Exporter: quantized weight
        Exporter->>Quantizer: restore quantizer buffers to original devices
        Exporter->>CPU: move quantized weight -> CPU and store for export
    else no enabled weight_quantizer
        Exporter->>CPU: move module.weight -> CPU and store for export
    end
    Exporter->>Exporter: emit quantizer state excluding items with "weight_quantizer" in name

sequenceDiagram
    participant Loader
    participant Checkpoint
    participant Model
    participant Module
    Loader->>Checkpoint: read checkpoint quantizer keys
    Loader->>Model: iterate model.named_modules()
    alt missing checkpoint key contains "weight_quantizer" and maps to module path
        Loader->>Module: call module.disable() if available
    else missing key unrelated or cannot map
        Loader-->>Loader: raise ValueError
    end
    Loader->>Model: continue loading remaining state

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'vLLM fakequant fold weight_quantizer for megatron export' directly and specifically describes the main change: implementing weight_quantizer folding during the Megatron-to-vLLM fake-quantization export process, which is the core objective addressed by changes across all three modified files.
Security Anti-Patterns	✅ Passed	All security-sensitive code patterns adhere to SECURITY.md guidelines: torch.load() uses weights_only=True, trust_remote_code defaults to False and is configurable, no unsafe patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kinjal/vllm_fq_megatron_weight_fold

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-13T15:15:37Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 16:44 UTC

codecov · 2026-04-13T15:24:08Z

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.56%. Comparing base (07ae8e7) to head (5b7eaf0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...pt/torch/export/plugins/vllm_fakequant_megatron.py	88.88%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
+ Coverage   75.61%   76.56%   +0.94%     
==========================================
  Files         459      459              
  Lines       48597    48613      +16     
==========================================
+ Hits        36747    37220     +473     
+ Misses      11850    11393     -457

Flag	Coverage Δ
examples	`41.38% <0.00%> (+11.54%)`	⬆️
gpu	`59.96% <88.88%> (-0.54%)`	⬇️
unit	`52.20% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

examples/vllm_serve/vllm_reload_utils.py (2)

462-472: Consider filtering checkpoint count symmetrically for clearer comparison.

The comparison at line 467 compares checkpoint_quant_count (all quantizer keys) against model_non_wq_quant_count (excluding weight_quantizer). This works correctly when the checkpoint has no weight_quantizer keys (expected behavior), but the asymmetry could cause a spurious warning if a checkpoint unexpectedly contains weight_quantizer entries.

For symmetry and clarity, consider filtering the checkpoint count the same way:

♻️ Suggested change for symmetric comparison

-    checkpoint_quant_count = len(checkpoint_quant_keys)
-    
-    # Ensure counts match (excluding weight quantizer keys, which may be absent when weights
-    # were pre-folded at export)
-    model_non_wq_quant_count = sum(1 for k in model_quant_keys if "weight_quantizer" not in k)
-    if checkpoint_quant_count != model_non_wq_quant_count:
+    # Ensure counts match (excluding weight quantizer keys, which may be absent when weights
+    # were pre-folded at export)
+    checkpoint_non_wq_quant_count = sum(
+        1 for k in checkpoint_quant_keys if "weight_quantizer" not in k
+    )
+    model_non_wq_quant_count = sum(1 for k in model_quant_keys if "weight_quantizer" not in k)
+    if checkpoint_non_wq_quant_count != model_non_wq_quant_count:
         warnings.warn(
-            f"Mismatch in quantizer state key counts: checkpoint has {checkpoint_quant_count} "
+            f"Mismatch in quantizer state key counts: checkpoint has {checkpoint_non_wq_quant_count} "
             f"quant keys but model has {model_non_wq_quant_count} non-weight quantizer state keys. "
             f"This can happen if the model is using PP."
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_reload_utils.py` around lines 462 - 472, The
comparison currently uses checkpoint_quant_count (len(checkpoint_quant_keys)) vs
model_non_wq_quant_count (sum over model_quant_keys excluding
"weight_quantizer"), which is asymmetric; change the checkpoint side to filter
out weight quantizer keys the same way (e.g., compute
checkpoint_non_wq_quant_count = sum(1 for k in checkpoint_quant_keys if
"weight_quantizer" not in k) and compare that to model_non_wq_quant_count) so
the warning only fires when non-weight quantizer key counts truly differ; update
references to checkpoint_quant_count in the surrounding logic to use the new
filtered count.

463-463: Remove trailing whitespace.

Line 463 appears to have trailing whitespace. Ruff will flag this.

🧹 Fix

-    
+

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_reload_utils.py` at line 463, Remove the trailing
whitespace flagged by Ruff in the vllm_reload_utils.py module by deleting the
extra spaces at the end of the blank line around the reload utilities area (the
stray whitespace on that empty line); ensure the line contains no trailing
spaces so the file passes linting.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/vllm_serve/vllm_reload_utils.py`:
- Around line 462-472: The comparison currently uses checkpoint_quant_count
(len(checkpoint_quant_keys)) vs model_non_wq_quant_count (sum over
model_quant_keys excluding "weight_quantizer"), which is asymmetric; change the
checkpoint side to filter out weight quantizer keys the same way (e.g., compute
checkpoint_non_wq_quant_count = sum(1 for k in checkpoint_quant_keys if
"weight_quantizer" not in k) and compare that to model_non_wq_quant_count) so
the warning only fires when non-weight quantizer key counts truly differ; update
references to checkpoint_quant_count in the surrounding logic to use the new
filtered count.
- Line 463: Remove the trailing whitespace flagged by Ruff in the
vllm_reload_utils.py module by deleting the extra spaces at the end of the blank
line around the reload utilities area (the stray whitespace on that empty line);
ensure the line contains no trailing spaces so the file passes linting.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4d77ebdd-19dd-4f19-b343-71baeb996154

📥 Commits

Reviewing files that changed from the base of the PR and between b53afa6 and 8c0af46.

📒 Files selected for processing (1)

examples/vllm_serve/vllm_reload_utils.py

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/vllm_serve/vllm_reload_utils.py`:
- Around line 440-463: The loop that looks for weight_quantizer module suffixes
can silently ignore keys like "model.weight_quantizer_v2._amax" because wq_i may
be None; update the logic in the block processing model_quant_keys (the code
that builds missing_wq_module_paths using wq_i) to not silently skip when wq_i
is None: either raise a ValueError naming the offending key or emit a clear
warning via logging (include the key and explain it didn’t match the expected
"weight_quantizer" module suffix), then proceed to add valid module paths to
missing_wq_module_paths and continue to the later loop that calls
module.disable() on modules found by model.named_modules().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4625d7a6-f374-4790-ab93-d358d6aa3a27

📥 Commits

Reviewing files that changed from the base of the PR and between 8c0af46 and 6016cfe.

📒 Files selected for processing (2)

examples/vllm_serve/vllm_reload_utils.py
modelopt/torch/export/plugins/vllm_fakequant_megatron.py

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

meenchen

1. Correctness — Stricter Missing Key Handling May Break PP

[QUESTION] Removed count-mismatch warning may regress pipeline parallelism
The old code warned when checkpoint and model quantizer key counts didn't match (e.g., PP splits) and proceeded. The new code raises ValueError for any missing non-weight-quantizer key. If PP causes missing input_quantizer keys in the checkpoint (model has more layers than this rank's checkpoint), this would now error where it previously warned. Has this been tested with PP > 1?

2. Correctness — `module.disable()` without type check

[SUGGESTION] No isinstance guard before calling disable()

for name, module in model.named_modules():
    if name in missing_wq_module_paths and hasattr(module, "disable"):
        module.disable()

The hasattr(module, "disable") check is a reasonable duck-type guard, but if any non-TensorQuantizer module at that path happens to have a disable() method, it would be called unintentionally. Consider adding isinstance(module, TensorQuantizer) for safety.

3. Design — Device juggling in `_get_quantized_state`

[SUGGESTION] Temporary CUDA lift for quantizer buffers is fragile
When weight is on CPU, buffers are moved to CUDA, the quantizer runs, then buffers are restored. If the quantizer forward internally creates new buffers or registers state, those won't be moved back. The finally block only restores buffers captured before the forward. Consider a comment noting this assumption, or use quantizer.to(quant_device) / quantizer.to(orig_device) for a cleaner round-trip.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 · 2026-04-16T00:36:17Z

Correctness — Stricter Missing Key Handling May Break PP
[QUESTION] Removed count-mismatch warning may regress pipeline parallelism
The old code warned when checkpoint and model quantizer key counts didn't match (e.g., PP splits) and proceeded. The new code raises ValueError for any missing non-weight-quantizer key. If PP causes missing input_quantizer keys in the checkpoint (model has more layers than this rank's checkpoint), this would now error where it previously warned. Has this been tested with PP > 1?

Addressed it in 3e2ec17. Verified it works with PP>1

Correctness — module.disable() without type check
[SUGGESTION] No isinstance guard before calling disable()

for name, module in model.named_modules():
if name in missing_wq_module_paths and hasattr(module, "disable"):
module.disable()
The hasattr(module, "disable") check is a reasonable duck-type guard, but if any non-TensorQuantizer module at that path happens to have a disable() method, it would be called unintentionally. Consider adding isinstance(module, TensorQuantizer) for safety.

Addressed it in 3e2ec17

Design — Device juggling in _get_quantized_state
[SUGGESTION] Temporary CUDA lift for quantizer buffers is fragile
When weight is on CPU, buffers are moved to CUDA, the quantizer runs, then buffers are restored. If the quantizer forward internally creates new buffers or registers state, those won't be moved back. The finally block only restores buffers captured before the forward. Consider a comment noting this assumption, or use quantizer.to(quant_device) / quantizer.to(orig_device) for a cleaner round-trip.

Addressed it in 3e2ec17

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

mxinO · 2026-04-16T08:40:04Z

The loading logic changes seem mainly for disabling weight quantizers, disabling weight quantizers seem an often used case, a cleaner way may be directly add an env LOADING_FOLDED_CKPT in fakequant_worker.py, and append a {"quantizer_name": "*weight_quantizer", "enable": False} to the quant config if set, so all these handling can be avoided.

kinjalpatel27 · 2026-04-16T16:42:31Z

The loading logic changes seem mainly for disabling weight quantizers, disabling weight quantizers seem an often used case, a cleaner way may be directly add an env LOADING_FOLDED_CKPT in fakequant_worker.py, and append a {"quantizer_name": "*weight_quantizer", "enable": False} to the quant config if set, so all these handling can be avoided.
I think adding _union_quantizer_keys_across_ranks is helpful regardless since it avoids the warning if PP > 1. The main difference is checking for weight_quantizer is being disabled if the keys are missing in ckpt. I wanted to avoid one more environment variable which user has to manage. Eventually, we will want to add support s.t. user will have to pass just checkpoint directory (and may be quantizer state file) and and we should be able to reload the exported checkpoint

kinjalpatel27 added 2 commits April 10, 2026 17:24

support for folding weights megatron-vllm export

71cde88

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

minor

b53afa6

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 added 2 commits April 13, 2026 15:24

minor

8c0af46

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

minor

b147d49

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

minor

6016cfe

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Comment thread examples/vllm_serve/vllm_reload_utils.py

minor

0ca8f87

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 marked this pull request as ready for review April 14, 2026 23:42

kinjalpatel27 requested review from a team as code owners April 14, 2026 23:42

kinjalpatel27 requested a review from cjluo-nv April 14, 2026 23:42

kinjalpatel27 changed the title ~~Kinjal/vllm fq megatron weight fold~~ vLLM fakequant fold weight_quantizer for megatron export Apr 14, 2026

jenchen13 reviewed Apr 15, 2026

View reviewed changes

Comment thread modelopt/torch/export/plugins/vllm_fakequant_megatron.py

kinjalpatel27 requested a review from jenchen13 April 15, 2026 20:19

realAsma approved these changes Apr 15, 2026

View reviewed changes

jenchen13 approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'main' into kinjal/vllm_fq_megatron_weight_fold

d9e8f1b

meenchen reviewed Apr 15, 2026

View reviewed changes

realAsma requested a review from mxinO April 15, 2026 23:34

cjluo-nv approved these changes Apr 16, 2026

View reviewed changes

addressed comments

3e2ec17

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

kinjalpatel27 requested a review from meenchen April 16, 2026 00:36

kinjalpatel27 and others added 2 commits April 16, 2026 01:02

minor

e454b79

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

Merge branch 'main' into kinjal/vllm_fq_megatron_weight_fold

5b7eaf0

meenchen approved these changes Apr 16, 2026

View reviewed changes

kinjalpatel27 merged commit f238d93 into main Apr 16, 2026
57 of 61 checks passed

kinjalpatel27 deleted the kinjal/vllm_fq_megatron_weight_fold branch April 16, 2026 16:42

Conversation

kinjalpatel27 commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

1. Correctness — Stricter Missing Key Handling May Break PP

2. Correctness — module.disable() without type check

3. Design — Device juggling in _get_quantized_state

Uh oh!

kinjalpatel27 commented Apr 16, 2026

Uh oh!

mxinO commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kinjalpatel27 commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kinjalpatel27 commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading

codecov bot commented Apr 13, 2026 •

edited

Loading

2. Correctness — `module.disable()` without type check

3. Design — Device juggling in `_get_quantized_state`

mxinO commented Apr 16, 2026 •

edited

Loading