Skip to content

Commit 9fcb771

Browse files
committed
Add post-quantization validation for MoE models to PTQ skill
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent e8775a6 commit 9fcb771

File tree

2 files changed

+74
-0
lines changed

2 files changed

+74
-0
lines changed

.claude/skills/ptq/SKILL.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,79 @@ ls -lh <output_path>/
113113

114114
Report the path and size to the user.
115115

116+
### Post-quantization validation
117+
118+
Verify the exported checkpoint's quantization pattern matches the recipe used. The quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures.
119+
120+
**What to check**: The recipe defines which layers should be quantized. For example:
121+
- `nvfp4_mlp_only`: all MLP layers (including MoE experts) quantized, attention layers excluded
122+
- `nvfp4_default`: all linear layers quantized
123+
- `fp8`: all linear layers quantized to FP8
124+
125+
**Run the validation script** against the exported checkpoint:
126+
127+
```bash
128+
python3 -c "
129+
import json, re, fnmatch
130+
131+
output = '<output_path>'
132+
idx = json.load(open(f'{output}/model.safetensors.index.json'))
133+
cfg = json.load(open(f'{output}/hf_quant_config.json'))
134+
excludes = cfg['quantization']['exclude_modules']
135+
136+
all_keys = set(idx['weight_map'].keys())
137+
# Identify linear weight params (skip norms, embeddings, scalars, scales)
138+
skip_suffixes = ('_scale', '_scale_2', 'layernorm', 'layer_norm', 'norm.weight', 'embed', 'scalar')
139+
linear_weights = sorted(k for k in all_keys
140+
if k.endswith('.weight') and not any(s in k.lower() for s in skip_suffixes))
141+
142+
# Check which have quantization scales
143+
quantized, excluded, unexpected = [], [], []
144+
for w in linear_weights:
145+
base = w.rsplit('.weight', 1)[0]
146+
has_scales = any(f'{base}.{s}' in all_keys for s in ['weight_scale', 'input_scale'])
147+
is_excluded = any(fnmatch.fnmatch(w, p) or fnmatch.fnmatch(base, p) for p in excludes)
148+
149+
if has_scales:
150+
quantized.append(w)
151+
elif is_excluded:
152+
excluded.append(w)
153+
else:
154+
unexpected.append(w)
155+
156+
print(f'Quantized layers: {len(quantized)}')
157+
print(f'Excluded layers (in exclude_modules): {len(excluded)}')
158+
if unexpected:
159+
print(f'\\nWARNING: {len(unexpected)} layers have NO scales and are NOT in exclude list:')
160+
# Group by module type for readability
161+
groups = {}
162+
for w in unexpected:
163+
parts = w.split('.')
164+
# Extract module type (e.g., 'self_attn', 'mlp', 'experts', 'router')
165+
module_type = next((p for p in parts if p in ('self_attn', 'mlp', 'experts', 'router', 'lm_head', 'embed_tokens')), 'other')
166+
groups.setdefault(module_type, []).append(w)
167+
for mtype, weights in sorted(groups.items()):
168+
print(f' {mtype}: {len(weights)} weights (e.g., {weights[0]})')
169+
print()
170+
print('These layers were silently skipped during quantization.')
171+
print('Likely cause: quantization config patterns did not match these module names.')
172+
print('This WILL cause deployment failures (framework loads them as quantized but they are BF16).')
173+
print('Fix: add missing patterns to the config, or add to exclude_modules if intentionally unquantized.')
174+
else:
175+
print('\\nAll layers are either quantized or explicitly excluded. Checkpoint is consistent.')
176+
"
177+
```
178+
179+
**Common pattern gaps** (layers silently skipped):
180+
181+
| Model | Module path | Missed by | Fix |
182+
|-------|-------------|-----------|-----|
183+
| Gemma4 MoE | `layers.N.experts.*` | `*mlp*`, `*block_sparse_moe*` | Add `*.experts.*` (PR #1219) |
184+
| Custom MoE | `layers.N.moe_block.experts.*` | `*mlp*` | Add matching pattern |
185+
| VLM projector | `multi_modal_projector.*` | usually excluded | Verify in exclude list |
186+
187+
**If warnings appear**: either the layers should have been quantized (fix the config pattern and re-run PTQ), or they are intentionally unquantized (add them to `exclude_modules` in the checkpoint's `hf_quant_config.json` and `config.json` to prevent deployment failures).
188+
116189
## Key API Rules
117190

118191
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`

.claude/skills/ptq/references/unsupported-models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path)
347347
- **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
348348
- **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
349349
- **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
350+
- **Validate quantization pattern after export**: Run the validation script from SKILL.md Step 5 on the exported checkpoint. It checks every linear layer is either quantized (has scale params) or explicitly excluded. Layers that are neither were silently skipped — common for models with non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns). This causes deployment failures when the framework tries to load BF16 weights as quantized
350351
- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down

0 commit comments

Comments
 (0)