You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add post-quantization validation for MoE models to PTQ skill
PTQ can silently skip MoE expert quantization when config patterns
(*mlp*, *block_sparse_moe*) don't match the model's naming convention
(e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*).
This causes deployment failures downstream when vLLM/SGLang tries to
load unquantized experts as quantized.
Add Step 5 validation to detect this:
- Compare exported weight names against scale params and exclude list
- Flag weights with no scales that aren't in exclude_modules
- Reference the deployment "quant/unquant layer confusion" pattern
Also add MoE expert verification to unsupported-models.md debugging tips.
Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were
BF16, causing vLLM FusedMoE shape mismatch at deployment time.
Fix tracked in PR #1219.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Copy file name to clipboardExpand all lines: .claude/skills/ptq/SKILL.md
+73Lines changed: 73 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,6 +113,79 @@ ls -lh <output_path>/
113
113
114
114
Report the path and size to the user.
115
115
116
+
### Post-quantization validation
117
+
118
+
Verify the exported checkpoint's quantization pattern matches the recipe used. The quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures.
119
+
120
+
**What to check**: The recipe defines which layers should be quantized. For example:
| VLM projector |`multi_modal_projector.*`| usually excluded | Verify in exclude list |
186
+
187
+
**If warnings appear**: either the layers should have been quantized (fix the config pattern and re-run PTQ), or they are intentionally unquantized (add them to `exclude_modules` in the checkpoint's `hf_quant_config.json` and `config.json` to prevent deployment failures).
188
+
116
189
## Key API Rules
117
190
118
191
-`mtq.register()` classes **must** define `_setup()` and call it from `__init__`
- **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
348
348
- **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
349
349
- **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
350
+
- **Validate quantization pattern after export**: Run the validation script from SKILL.md Step 5 on the exported checkpoint. It checks every linear layer is either quantized (has scale params) or explicitly excluded. Layers that are neither were silently skipped — common for models with non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns). This causes deployment failures when the framework tries to load BF16 weights as quantized
350
351
- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down
0 commit comments