Skip to content

vLLM fakequant fold weight_quantizer for megatron export#1246

Merged
kinjalpatel27 merged 10 commits intomainfrom
kinjal/vllm_fq_megatron_weight_fold
Apr 16, 2026
Merged

vLLM fakequant fold weight_quantizer for megatron export#1246
kinjalpatel27 merged 10 commits intomainfrom
kinjal/vllm_fq_megatron_weight_fold

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented Apr 13, 2026

What does this PR do?

Type of change: Bug fix

During Megatron→vLLM fakequant export (export_mcore_gpt_to_hf_vllm_fq), the weight_quantizer is now applied as fake-quantization (quantize + dequantize) directly into the exported weight tensor, and its amax is no longer saved to quantizer_state.pth. On reload, if weight_quantizer keys are absent from the checkpoint (because they were folded at export time), the corresponding quantizer modules are disabled.
This change is useful especially when amax across experts are not synced for weight_quantizer, this allows the weight_quantizer to keep them different for better accuracy.

Usage

# Unchanged — export API is the same                                                                                                                                                                         
export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...)

Testing

Step 1 — Quantize (run from Megatron-LM examples/post_training/modelopt):

HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \
bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG

Step 2 — Export for vLLM fakequant:

MLM_EXTRA_ARGS=--export-vllm-fq \ 
HF_MODEL_CKPT=<path/to/hf/weights> \ 
MLM_MODEL_CKPT=<quant-ckpt-name> \ 
EXPORT_DIR=<export-dir> \ 
bash export.sh <hf-model-id> 

Step 3 — Serve (run from examples/vllm_serve):

 QUANT_CFG=NVFP4_DEFAULT_CFG \ 
 QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ 
 python3 vllm_serve_fakequant.py <export-dir> \ 
 -tp 1 --served-model-name <model-name> \ 
 --host 0.0.0.0 --port 8000 \ 
--trust-remote-code --enforce-eager \ 
 --disable-custom-all-reduce \ 
--gpu-memory-utilization 0.8          

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures.
  • Improvements

    • Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Changes modify quantizer weight handling in vLLM export and reload workflows. The export plugin now folds enabled weight quantizers into exported weights and skips exporting weight-quantizer tensors. The reload utility validates missing quantizer keys and disables corresponding modules for weight quantizers to prevent folding during checkpoint loading. One whitespace-only change was added.

Changes

Cohort / File(s) Summary
Whitespace Formatting
examples/vllm_serve/fakequant_worker.py
Added a blank line after trust_remote_code computation; no functional changes.
Quantizer State Validation & Module Disabling
examples/vllm_serve/vllm_reload_utils.py
Removed global quantizer-key-count check and warnings import. Now validates missing checkpoint keys: permits missing keys that contain weight_quantizer only if the missing key can be mapped to a module path; records such module paths and calls module.disable() for modules with a disable() method to prevent weight folding. Other missing quantizer keys still raise ValueError.
Weight Quantizer Folding & Export
modelopt/torch/export/plugins/vllm_fakequant_megatron.py
When exporting weights, if module.weight_quantizer exists and is enabled, apply fake-quantization to the weight under torch.no_grad() (temporarily moving quantizer buffers to a CUDA device if needed), convert to target dtype, then move weight to CPU for export. Skip exporting quantizer state items whose names include weight_quantizer while still exporting other quantizer state.

Sequence Diagram(s)

sequenceDiagram
    participant Exporter
    participant Module
    participant Quantizer
    participant CPU
    participant CUDA
    Exporter->>Module: inspect weight, check weight_quantizer
    alt weight_quantizer exists and enabled
        Exporter->>Quantizer: move quantizer buffers -> CUDA (if weight on CPU and CUDA available)
        Exporter->>Quantizer: apply quantizer(weight) under no_grad
        Quantizer-->>Exporter: quantized weight
        Exporter->>Quantizer: restore quantizer buffers to original devices
        Exporter->>CPU: move quantized weight -> CPU and store for export
    else no enabled weight_quantizer
        Exporter->>CPU: move module.weight -> CPU and store for export
    end
    Exporter->>Exporter: emit quantizer state excluding items with "weight_quantizer" in name
Loading
sequenceDiagram
    participant Loader
    participant Checkpoint
    participant Model
    participant Module
    Loader->>Checkpoint: read checkpoint quantizer keys
    Loader->>Model: iterate model.named_modules()
    alt missing checkpoint key contains "weight_quantizer" and maps to module path
        Loader->>Module: call module.disable() if available
    else missing key unrelated or cannot map
        Loader-->>Loader: raise ValueError
    end
    Loader->>Model: continue loading remaining state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'vLLM fakequant fold weight_quantizer for megatron export' directly and specifically describes the main change: implementing weight_quantizer folding during the Megatron-to-vLLM fake-quantization export process, which is the core objective addressed by changes across all three modified files.
Security Anti-Patterns ✅ Passed All security-sensitive code patterns adhere to SECURITY.md guidelines: torch.load() uses weights_only=True, trust_remote_code defaults to False and is configurable, no unsafe patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_fq_megatron_weight_fold

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 16:44 UTC

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.56%. Comparing base (07ae8e7) to head (5b7eaf0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...pt/torch/export/plugins/vllm_fakequant_megatron.py 88.88% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
+ Coverage   75.61%   76.56%   +0.94%     
==========================================
  Files         459      459              
  Lines       48597    48613      +16     
==========================================
+ Hits        36747    37220     +473     
+ Misses      11850    11393     -457     
Flag Coverage Δ
examples 41.38% <0.00%> (+11.54%) ⬆️
gpu 59.96% <88.88%> (-0.54%) ⬇️
unit 52.20% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples/vllm_serve/vllm_reload_utils.py (2)

462-472: Consider filtering checkpoint count symmetrically for clearer comparison.

The comparison at line 467 compares checkpoint_quant_count (all quantizer keys) against model_non_wq_quant_count (excluding weight_quantizer). This works correctly when the checkpoint has no weight_quantizer keys (expected behavior), but the asymmetry could cause a spurious warning if a checkpoint unexpectedly contains weight_quantizer entries.

For symmetry and clarity, consider filtering the checkpoint count the same way:

♻️ Suggested change for symmetric comparison
-    checkpoint_quant_count = len(checkpoint_quant_keys)
-    
-    # Ensure counts match (excluding weight quantizer keys, which may be absent when weights
-    # were pre-folded at export)
-    model_non_wq_quant_count = sum(1 for k in model_quant_keys if "weight_quantizer" not in k)
-    if checkpoint_quant_count != model_non_wq_quant_count:
+    # Ensure counts match (excluding weight quantizer keys, which may be absent when weights
+    # were pre-folded at export)
+    checkpoint_non_wq_quant_count = sum(
+        1 for k in checkpoint_quant_keys if "weight_quantizer" not in k
+    )
+    model_non_wq_quant_count = sum(1 for k in model_quant_keys if "weight_quantizer" not in k)
+    if checkpoint_non_wq_quant_count != model_non_wq_quant_count:
         warnings.warn(
-            f"Mismatch in quantizer state key counts: checkpoint has {checkpoint_quant_count} "
+            f"Mismatch in quantizer state key counts: checkpoint has {checkpoint_non_wq_quant_count} "
             f"quant keys but model has {model_non_wq_quant_count} non-weight quantizer state keys. "
             f"This can happen if the model is using PP."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_reload_utils.py` around lines 462 - 472, The
comparison currently uses checkpoint_quant_count (len(checkpoint_quant_keys)) vs
model_non_wq_quant_count (sum over model_quant_keys excluding
"weight_quantizer"), which is asymmetric; change the checkpoint side to filter
out weight quantizer keys the same way (e.g., compute
checkpoint_non_wq_quant_count = sum(1 for k in checkpoint_quant_keys if
"weight_quantizer" not in k) and compare that to model_non_wq_quant_count) so
the warning only fires when non-weight quantizer key counts truly differ; update
references to checkpoint_quant_count in the surrounding logic to use the new
filtered count.

463-463: Remove trailing whitespace.

Line 463 appears to have trailing whitespace. Ruff will flag this.

🧹 Fix
-    
+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/vllm_serve/vllm_reload_utils.py` at line 463, Remove the trailing
whitespace flagged by Ruff in the vllm_reload_utils.py module by deleting the
extra spaces at the end of the blank line around the reload utilities area (the
stray whitespace on that empty line); ensure the line contains no trailing
spaces so the file passes linting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/vllm_serve/vllm_reload_utils.py`:
- Around line 462-472: The comparison currently uses checkpoint_quant_count
(len(checkpoint_quant_keys)) vs model_non_wq_quant_count (sum over
model_quant_keys excluding "weight_quantizer"), which is asymmetric; change the
checkpoint side to filter out weight quantizer keys the same way (e.g., compute
checkpoint_non_wq_quant_count = sum(1 for k in checkpoint_quant_keys if
"weight_quantizer" not in k) and compare that to model_non_wq_quant_count) so
the warning only fires when non-weight quantizer key counts truly differ; update
references to checkpoint_quant_count in the surrounding logic to use the new
filtered count.
- Line 463: Remove the trailing whitespace flagged by Ruff in the
vllm_reload_utils.py module by deleting the extra spaces at the end of the blank
line around the reload utilities area (the stray whitespace on that empty line);
ensure the line contains no trailing spaces so the file passes linting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4d77ebdd-19dd-4f19-b343-71baeb996154

📥 Commits

Reviewing files that changed from the base of the PR and between b53afa6 and 8c0af46.

📒 Files selected for processing (1)
  • examples/vllm_serve/vllm_reload_utils.py

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/vllm_serve/vllm_reload_utils.py`:
- Around line 440-463: The loop that looks for weight_quantizer module suffixes
can silently ignore keys like "model.weight_quantizer_v2._amax" because wq_i may
be None; update the logic in the block processing model_quant_keys (the code
that builds missing_wq_module_paths using wq_i) to not silently skip when wq_i
is None: either raise a ValueError naming the offending key or emit a clear
warning via logging (include the key and explain it didn’t match the expected
"weight_quantizer" module suffix), then proceed to add valid module paths to
missing_wq_module_paths and continue to the later loop that calls
module.disable() on modules found by model.named_modules().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4625d7a6-f374-4790-ab93-d358d6aa3a27

📥 Commits

Reviewing files that changed from the base of the PR and between 8c0af46 and 6016cfe.

📒 Files selected for processing (2)
  • examples/vllm_serve/vllm_reload_utils.py
  • modelopt/torch/export/plugins/vllm_fakequant_megatron.py

Comment thread examples/vllm_serve/vllm_reload_utils.py
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review April 14, 2026 23:42
@kinjalpatel27 kinjalpatel27 requested review from a team as code owners April 14, 2026 23:42
@kinjalpatel27 kinjalpatel27 requested a review from cjluo-nv April 14, 2026 23:42
@kinjalpatel27 kinjalpatel27 changed the title Kinjal/vllm fq megatron weight fold vLLM fakequant fold weight_quantizer for megatron export Apr 14, 2026
Comment thread modelopt/torch/export/plugins/vllm_fakequant_megatron.py
@kinjalpatel27 kinjalpatel27 requested a review from jenchen13 April 15, 2026 20:19
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Correctness — Stricter Missing Key Handling May Break PP

[QUESTION] Removed count-mismatch warning may regress pipeline parallelism
The old code warned when checkpoint and model quantizer key counts didn't match (e.g., PP splits) and proceeded. The new code raises ValueError for any missing non-weight-quantizer key. If PP causes missing input_quantizer keys in the checkpoint (model has more layers than this rank's checkpoint), this would now error where it previously warned. Has this been tested with PP > 1?

2. Correctness — module.disable() without type check

[SUGGESTION] No isinstance guard before calling disable()

for name, module in model.named_modules():
    if name in missing_wq_module_paths and hasattr(module, "disable"):
        module.disable()

The hasattr(module, "disable") check is a reasonable duck-type guard, but if any non-TensorQuantizer module at that path happens to have a disable() method, it would be called unintentionally. Consider adding isinstance(module, TensorQuantizer) for safety.

3. Design — Device juggling in _get_quantized_state

[SUGGESTION] Temporary CUDA lift for quantizer buffers is fragile
When weight is on CPU, buffers are moved to CUDA, the quantizer runs, then buffers are restored. If the quantizer forward internally creates new buffers or registers state, those won't be moved back. The finally block only restores buffers captured before the forward. Consider a comment noting this assumption, or use quantizer.to(quant_device) / quantizer.to(orig_device) for a cleaner round-trip.

@realAsma realAsma requested a review from mxinO April 15, 2026 23:34
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27
Copy link
Copy Markdown
Contributor Author

  1. Correctness — Stricter Missing Key Handling May Break PP
    [QUESTION] Removed count-mismatch warning may regress pipeline parallelism
    The old code warned when checkpoint and model quantizer key counts didn't match (e.g., PP splits) and proceeded. The new code raises ValueError for any missing non-weight-quantizer key. If PP causes missing input_quantizer keys in the checkpoint (model has more layers than this rank's checkpoint), this would now error where it previously warned. Has this been tested with PP > 1?

Addressed it in 3e2ec17. Verified it works with PP>1

  1. Correctness — module.disable() without type check
    [SUGGESTION] No isinstance guard before calling disable()

for name, module in model.named_modules():
if name in missing_wq_module_paths and hasattr(module, "disable"):
module.disable()
The hasattr(module, "disable") check is a reasonable duck-type guard, but if any non-TensorQuantizer module at that path happens to have a disable() method, it would be called unintentionally. Consider adding isinstance(module, TensorQuantizer) for safety.

Addressed it in 3e2ec17

  1. Design — Device juggling in _get_quantized_state
    [SUGGESTION] Temporary CUDA lift for quantizer buffers is fragile
    When weight is on CPU, buffers are moved to CUDA, the quantizer runs, then buffers are restored. If the quantizer forward internally creates new buffers or registers state, those won't be moved back. The finally block only restores buffers captured before the forward. Consider a comment noting this assumption, or use quantizer.to(quant_device) / quantizer.to(orig_device) for a cleaner round-trip.

Addressed it in 3e2ec17

@kinjalpatel27 kinjalpatel27 requested a review from meenchen April 16, 2026 00:36
kinjalpatel27 and others added 2 commits April 16, 2026 01:02
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@mxinO
Copy link
Copy Markdown
Contributor

mxinO commented Apr 16, 2026

The loading logic changes seem mainly for disabling weight quantizers, disabling weight quantizers seem an often used case, a cleaner way may be directly add an env LOADING_FOLDED_CKPT in fakequant_worker.py, and append a {"quantizer_name": "*weight_quantizer", "enable": False} to the quant config if set, so all these handling can be avoided.

@kinjalpatel27
Copy link
Copy Markdown
Contributor Author

The loading logic changes seem mainly for disabling weight quantizers, disabling weight quantizers seem an often used case, a cleaner way may be directly add an env LOADING_FOLDED_CKPT in fakequant_worker.py, and append a {"quantizer_name": "*weight_quantizer", "enable": False} to the quant config if set, so all these handling can be avoided.
I think adding _union_quantizer_keys_across_ranks is helpful regardless since it avoids the warning if PP > 1. The main difference is checking for weight_quantizer is being disabled if the keys are missing in ckpt. I wanted to avoid one more environment variable which user has to manage. Eventually, we will want to add support s.t. user will have to pass just checkpoint directory (and may be quantizer state file) and and we should be able to reload the exported checkpoint

@kinjalpatel27 kinjalpatel27 merged commit f238d93 into main Apr 16, 2026
57 of 61 checks passed
@kinjalpatel27 kinjalpatel27 deleted the kinjal/vllm_fq_megatron_weight_fold branch April 16, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants