Add post-quantization validation for MoE models to PTQ skill

Edwardf0t1 · Edwardf0t1 · commit b14ed545fbf4 · 2026-04-15T01:35:31.000-07:00
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
@@ -113,6 +113,38 @@ ls -lh <output_path>/
 
 Report the path and size to the user.
 
+### Post-quantization validation (MoE models)
+
+For MoE models, verify that expert layers were actually quantized. The quantization config patterns (`*mlp*`, `*block_sparse_moe*`) may not match all model architectures — e.g., Gemma4 uses `layers.N.experts.*` which is missed by these patterns, leaving experts unquantized without any warning.
+
+**Check 1**: Compare exported weight names against `hf_quant_config.json` exclude list. If large parameter groups (especially expert/MoE weights) have no corresponding scale params (`weight_scale`, `input_scale`) AND are not in `exclude_modules`, they were silently skipped:
+
+```bash
+python3 -c "
+import json
+idx = json.load(open('<output>/model.safetensors.index.json'))
+cfg = json.load(open('<output>/hf_quant_config.json'))
+excludes = cfg['quantization']['exclude_modules']
+
+# Find weights without scales (potential unquantized layers)
+all_keys = set(idx['weight_map'].keys())
+base_weights = {k for k in all_keys if not any(s in k for s in ['scale', 'norm', 'layernorm', 'embed'])}
+has_scale = {k for k in all_keys if 'scale' in k}
+
+for w in sorted(base_weights):
+    w_base = w.rsplit('.', 1)[0]  # strip .weight suffix
+    if not any(s in all_keys for s in [f'{w_base}.weight_scale', f'{w_base}.input_scale', f'{w}.weight_scale']):
+        if not any(p.replace('*', '') in w for p in excludes):
+            print(f'WARNING: {w} has no scale params and is not in exclude list')
+"
+```
+
+If warnings appear for expert/MoE layers, the quantization patterns missed them. Fix by either:
+- Re-running PTQ with a custom config adding the missing pattern (e.g., `"*.experts.*"`)
+- If the model already has a fix PR (check `modelopt/torch/quantization/plugins/huggingface.py`), update ModelOpt and re-run
+
+**Check 2**: For models with fused 3D expert tensors, verify the export added them to `exclude_modules` if they weren't quantized. Missing exclude entries cause deployment failures (vLLM/SGLang try to load them as quantized). See `deployment/references/unsupported-models.md` for the "quantized/unquantized layer confusion" pattern.
+
 ## Key API Rules
 
 - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md
@@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path)
 - **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
 - **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
 - **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
+- **Verify MoE experts are quantized**: For MoE models, check if the exported checkpoint has scale params for expert weights. If experts use a non-standard naming (e.g., Gemma4's `layers.N.experts.*` instead of `mlp.experts.*` or `block_sparse_moe.*`), the quantization config patterns may silently miss them. Compare checkpoint weight names against `hf_quant_config.json` exclude list — if experts have no scales and aren't excluded, they were skipped. This causes deployment failures (vLLM/SGLang try to load them as quantized). Fix by adding the missing pattern to the config or checking for a ModelOpt plugin for the model (e.g., `modelopt/torch/quantization/plugins/huggingface.py`)
 - **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down