Add post-quantization validation for MoE models to PTQ skill

Edwardf0t1 · Edwardf0t1 · commit 9fcb77141cbb · 2026-04-16T01:28:16.000-07:00
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
@@ -113,6 +113,79 @@ ls -lh <output_path>/
 
 Report the path and size to the user.
 
+### Post-quantization validation
+
+Verify the exported checkpoint's quantization pattern matches the recipe used. The quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures.
+
+**What to check**: The recipe defines which layers should be quantized. For example:
+- `nvfp4_mlp_only`: all MLP layers (including MoE experts) quantized, attention layers excluded
+- `nvfp4_default`: all linear layers quantized
+- `fp8`: all linear layers quantized to FP8
+
+**Run the validation script** against the exported checkpoint:
+
+```bash
+python3 -c "
+import json, re, fnmatch
+
+output = '<output_path>'
+idx = json.load(open(f'{output}/model.safetensors.index.json'))
+cfg = json.load(open(f'{output}/hf_quant_config.json'))
+excludes = cfg['quantization']['exclude_modules']
+
+all_keys = set(idx['weight_map'].keys())
+# Identify linear weight params (skip norms, embeddings, scalars, scales)
+skip_suffixes = ('_scale', '_scale_2', 'layernorm', 'layer_norm', 'norm.weight', 'embed', 'scalar')
+linear_weights = sorted(k for k in all_keys
+    if k.endswith('.weight') and not any(s in k.lower() for s in skip_suffixes))
+
+# Check which have quantization scales
+quantized, excluded, unexpected = [], [], []
+for w in linear_weights:
+    base = w.rsplit('.weight', 1)[0]
+    has_scales = any(f'{base}.{s}' in all_keys for s in ['weight_scale', 'input_scale'])
+    is_excluded = any(fnmatch.fnmatch(w, p) or fnmatch.fnmatch(base, p) for p in excludes)
+
+    if has_scales:
+        quantized.append(w)
+    elif is_excluded:
+        excluded.append(w)
+    else:
+        unexpected.append(w)
+
+print(f'Quantized layers: {len(quantized)}')
+print(f'Excluded layers (in exclude_modules): {len(excluded)}')
+if unexpected:
+    print(f'\\nWARNING: {len(unexpected)} layers have NO scales and are NOT in exclude list:')
+    # Group by module type for readability
+    groups = {}
+    for w in unexpected:
+        parts = w.split('.')
+        # Extract module type (e.g., 'self_attn', 'mlp', 'experts', 'router')
+        module_type = next((p for p in parts if p in ('self_attn', 'mlp', 'experts', 'router', 'lm_head', 'embed_tokens')), 'other')
+        groups.setdefault(module_type, []).append(w)
+    for mtype, weights in sorted(groups.items()):
+        print(f'  {mtype}: {len(weights)} weights (e.g., {weights[0]})')
+    print()
+    print('These layers were silently skipped during quantization.')
+    print('Likely cause: quantization config patterns did not match these module names.')
+    print('This WILL cause deployment failures (framework loads them as quantized but they are BF16).')
+    print('Fix: add missing patterns to the config, or add to exclude_modules if intentionally unquantized.')
+else:
+    print('\\nAll layers are either quantized or explicitly excluded. Checkpoint is consistent.')
+"
+```
+
+**Common pattern gaps** (layers silently skipped):
+
+| Model | Module path | Missed by | Fix |
+|-------|-------------|-----------|-----|
+| Gemma4 MoE | `layers.N.experts.*` | `*mlp*`, `*block_sparse_moe*` | Add `*.experts.*` (PR #1219) |
+| Custom MoE | `layers.N.moe_block.experts.*` | `*mlp*` | Add matching pattern |
+| VLM projector | `multi_modal_projector.*` | usually excluded | Verify in exclude list |
+
+**If warnings appear**: either the layers should have been quantized (fix the config pattern and re-run PTQ), or they are intentionally unquantized (add them to `exclude_modules` in the checkpoint's `hf_quant_config.json` and `config.json` to prevent deployment failures).
+
 ## Key API Rules
 
 - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md
@@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path)
 - **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
 - **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
 - **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
+- **Validate quantization pattern after export**: Run the validation script from SKILL.md Step 5 on the exported checkpoint. It checks every linear layer is either quantized (has scale params) or explicitly excluded. Layers that are neither were silently skipped — common for models with non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns). This causes deployment failures when the framework tries to load BF16 weights as quantized
 - **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down