Commit b14ed54
committed
Add post-quantization validation for MoE models to PTQ skill
PTQ can silently skip MoE expert quantization when config patterns
(*mlp*, *block_sparse_moe*) don't match the model's naming convention
(e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*).
This causes deployment failures downstream when vLLM/SGLang tries to
load unquantized experts as quantized.
Add Step 5 validation to detect this:
- Compare exported weight names against scale params and exclude list
- Flag weights with no scales that aren't in exclude_modules
- Reference the deployment "quant/unquant layer confusion" pattern
Also add MoE expert verification to unsupported-models.md debugging tips.
Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were
BF16, causing vLLM FusedMoE shape mismatch at deployment time.
Fix tracked in PR #1219.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>1 parent da41d79 commit b14ed54
2 files changed
+33
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
116 | 148 | | |
117 | 149 | | |
118 | 150 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
347 | 347 | | |
348 | 348 | | |
349 | 349 | | |
| 350 | + | |
350 | 351 | | |
0 commit comments