Skip to content

Commit b14ed54

Browse files
committed
Add post-quantization validation for MoE models to PTQ skill
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent da41d79 commit b14ed54

File tree

2 files changed

+33
-0
lines changed

2 files changed

+33
-0
lines changed

.claude/skills/ptq/SKILL.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,38 @@ ls -lh <output_path>/
113113

114114
Report the path and size to the user.
115115

116+
### Post-quantization validation (MoE models)
117+
118+
For MoE models, verify that expert layers were actually quantized. The quantization config patterns (`*mlp*`, `*block_sparse_moe*`) may not match all model architectures — e.g., Gemma4 uses `layers.N.experts.*` which is missed by these patterns, leaving experts unquantized without any warning.
119+
120+
**Check 1**: Compare exported weight names against `hf_quant_config.json` exclude list. If large parameter groups (especially expert/MoE weights) have no corresponding scale params (`weight_scale`, `input_scale`) AND are not in `exclude_modules`, they were silently skipped:
121+
122+
```bash
123+
python3 -c "
124+
import json
125+
idx = json.load(open('<output>/model.safetensors.index.json'))
126+
cfg = json.load(open('<output>/hf_quant_config.json'))
127+
excludes = cfg['quantization']['exclude_modules']
128+
129+
# Find weights without scales (potential unquantized layers)
130+
all_keys = set(idx['weight_map'].keys())
131+
base_weights = {k for k in all_keys if not any(s in k for s in ['scale', 'norm', 'layernorm', 'embed'])}
132+
has_scale = {k for k in all_keys if 'scale' in k}
133+
134+
for w in sorted(base_weights):
135+
w_base = w.rsplit('.', 1)[0] # strip .weight suffix
136+
if not any(s in all_keys for s in [f'{w_base}.weight_scale', f'{w_base}.input_scale', f'{w}.weight_scale']):
137+
if not any(p.replace('*', '') in w for p in excludes):
138+
print(f'WARNING: {w} has no scale params and is not in exclude list')
139+
"
140+
```
141+
142+
If warnings appear for expert/MoE layers, the quantization patterns missed them. Fix by either:
143+
- Re-running PTQ with a custom config adding the missing pattern (e.g., `"*.experts.*"`)
144+
- If the model already has a fix PR (check `modelopt/torch/quantization/plugins/huggingface.py`), update ModelOpt and re-run
145+
146+
**Check 2**: For models with fused 3D expert tensors, verify the export added them to `exclude_modules` if they weren't quantized. Missing exclude entries cause deployment failures (vLLM/SGLang try to load them as quantized). See `deployment/references/unsupported-models.md` for the "quantized/unquantized layer confusion" pattern.
147+
116148
## Key API Rules
117149

118150
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path)
347347
- **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
348348
- **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
349349
- **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
350+
- **Verify MoE experts are quantized**: For MoE models, check if the exported checkpoint has scale params for expert weights. If experts use a non-standard naming (e.g., Gemma4's `layers.N.experts.*` instead of `mlp.experts.*` or `block_sparse_moe.*`), the quantization config patterns may silently miss them. Compare checkpoint weight names against `hf_quant_config.json` exclude list — if experts have no scales and aren't excluded, they were skipped. This causes deployment failures (vLLM/SGLang try to load them as quantized). Fix by adding the missing pattern to the config or checking for a ModelOpt plugin for the model (e.g., `modelopt/torch/quantization/plugins/huggingface.py`)
350351
- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down

0 commit comments

Comments
 (0)