Commit 8bdcf91
[OMNIML-3707] Model-specific PTQ recipes bootstrap (#1506)
### What does this PR do?
Type of change: new feature
Replaces the hardcoded model-type branches in `examples/llm_ptq/` with
opt-in declarative **model-specific recipes** under
`modelopt_recipes/huggingface/<model_type>/ptq/`. Any adjustment
specific to a model type or instance must live in that model's recipe —
there is no implicit model-specific path anymore. Users select a model's
recipe with `--recipe huggingface/<model_type>/ptq/<recipe>`; users on
the plain `--qformat` path get only the generic numerics.
What moved out of Python
(`examples/llm_ptq/example_utils.py::build_quant_cfg` and
`examples/llm_ptq/hf_ptq.py::mono_quantize`):
- **gemma / mpt** `w4a8_awq` → `awq_lite` with `alpha_step=1` (coarser
search to avoid TRT-LLM overflow).
- **gemma** `int8_sq` → SmoothQuant `alpha=0.5` (default `1.0` regresses
Gemma 7B).
- **phi4mm** → disable `*speech*`, `*audio*`, `*image*`, `*vision*`
(quantize only the language model).
- **Nemotron VL** → disable `*vision*`, `*image*`, `*radio*`,
`*visual*`, `*encoder*`, `*model_encoder*` (quantize only the decoder).
What stayed in Python:
- MTP dynamic layer exclusion in `hf_ptq.py` (depends on
runtime-detected layer indices).
- `is_nemotron_vl(full_model)` detection itself, which still drives the
VLM calibration loop and the post-quantize `full_model` update — only
the `quant_cfg` adjustment it triggered moved into the Nemotron VL
recipe.
`multinode_ptq.py` shares the same `build_quant_cfg` call site and was
updated to match the new 2/3-arg signature; multinode users on
`--qformat` get the generic numerics (no `--recipe` plumbing in
multinode yet, so model-specific recipes are only reachable via
`hf_ptq.py`).
Already-YAML recipes that were elsewhere in the tree are relocated into
the same `huggingface/<model_type>/ptq/` layout so all model-specific
recipes live under one convention:
- **Step3.5-Flash** — moved from
`modelopt_recipes/huggingface/step3p5/Step3.5-Flash/` to
`huggingface/step3p5/Step3.5-Flash/ptq/` to match the `<model>/ptq/`
convention.
- **Qwen3.5 / Qwen3.6** — moved from
`modelopt_recipes/models/Qwen3.5-Qwen3.6/w4a16.yaml` to per-model_type
folders, anchored on the HuggingFace `model_type` (verified against
transformers 5.8.1 + HF model hub `config.json` for `Qwen/Qwen3.6-27B`,
`Qwen/Qwen3.6-35B-A3B`, `nvidia/Qwen3.5-397B-A17B-NVFP4`):
- `huggingface/qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml` —
dense `qwen3_5`
- `huggingface/qwen3_5_moe/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml` —
`qwen3_5_moe`
- Both wrappers `$import` the shared `quant_cfg` snippet
`huggingface/qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.quant_cfg.yaml`
(one source of truth; the two model_types share the same hybrid
linear-attention + softmax-attention architecture so the rules apply
identically).
Full recipe layout (`modelopt_recipes/huggingface/`):
```
gemma/ptq/{w4a8_awq,int8_sq}-kv_fp8_cast.yaml
mpt/ptq/w4a8_awq-kv_fp8_cast.yaml
phi4mm/ptq/{disabled_quantizers,nvfp4-kv_fp8_cast}.yaml
nemotron_vl/ptq/{disabled_quantizers,nvfp4-kv_fp8_cast}.yaml
qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast{,.quant_cfg}.yaml
qwen3_5_moe/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml
step3p5/Step3.5-Flash/ptq/nvfp4-mlp-only.yaml
```
All recipes ship with FP8 KV-cache cast (`kv_fp8_cast`). For phi4mm and
nemotron_vl, `disabled_quantizers.yaml` is a multi-document list unit
that `$import`s the standard `default_disabled_quantizers` exclusions
and appends the model-specific ones — so each recipe imports a single
disabled-quantizer slot instead of layering two, with no duplication in
YAML. Each `ptq/` folder has a `README.md` describing exactly what is
model-specific.
### Usage
```bash
# Gemma W4A8 AWQ with the Gemma-specific algorithm tuning + FP8 KV cache:
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path google/gemma-7b \
--recipe huggingface/gemma/ptq/w4a8_awq-kv_fp8_cast \
--export_path ./out
# Nemotron VL with vision branches excluded automatically:
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path nvidia/<nemotron-vl-model> \
--recipe huggingface/nemotron_vl/ptq/nvfp4-kv_fp8_cast \
--export_path ./out
```
### Testing
- Pre-commit recipe validator
(`tools/precommit/check_modelopt_recipes.py`) loads every new recipe via
`load_recipe()` — passes for all new YAMLs (gemma/mpt/phi4mm/nemotron_vl
recipes + phi4mm/nemotron_vl `disabled_quantizers` snippets + qwen3_5 /
qwen3_5_moe recipe wrappers + the shared
`w4a16_nvfp4-fp8_attn-kv_fp8_cast.quant_cfg` snippet + Step3.5-Flash
relocation).
- For qwen3_5 / qwen3_5_moe specifically, `load_recipe(...)` on both
wrappers produces an identical 33-entry resolved `quant_cfg`, confirming
the shared snippet is the single source of truth.
- `yamlfmt` + `markdownlint` + `bandit` + license-insertion hooks all
pass.
- No tests reference the removed `build_quant_cfg(qformat, ...,
model_type, ...)` signature; the only call sites (`hf_ptq.py`,
`multinode_ptq.py`) were updated to the new 2/3-arg form.
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ❌ — users who relied on
**automatic** model-specific quant_cfg behavior via `--qformat`
(gemma/mpt AWQ, gemma SmoothQuant, phi4mm exclusions, Nemotron VL
exclusions) now need to pass `--recipe
huggingface/<model_type>/ptq/<recipe>` to apply the model's recipe. The
flag itself is unchanged; only the implicit behavior was removed.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ❌ — relies on the existing
pre-commit recipe validator that loads each new YAML.
- Did you update Changelog?: ✅
- Did you get Claude approval on this PR?: ❌
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added many model-specific PTQ recipes (Gemma, MPT, Nemotron VL,
Phi‑4‑Multimodal, Qwen3.5, Qwen3.5‑MoE) and support for AWQ block-size
and MoE calibration ratio in quantization options.
* **Documentation**
* Expanded READMEs and changelog to document recipe locations, layout,
and how to opt into model-specific PTQ recipes.
* **Refactor**
* Model-specific PTQ tweaks moved to opt‑in recipes; default behavior
uses generic numerics.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1506?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>1 parent 09bef05 commit 8bdcf91
23 files changed
Lines changed: 610 additions & 168 deletions
File tree
- docs/source/guides
- examples/llm_ptq
- modelopt_recipes
- huggingface
- gemma/ptq
- mpt/ptq
- nemotron_vl/ptq
- phi4mm/ptq
- qwen3_5_moe/ptq
- qwen3_5/ptq
- step3p5/Step3.5-Flash/ptq
- models/Qwen3.5-Qwen3.6
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
20 | 24 | | |
21 | 25 | | |
22 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
511 | 511 | | |
512 | 512 | | |
513 | 513 | | |
514 | | - | |
515 | | - | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
516 | 519 | | |
517 | 520 | | |
518 | 521 | | |
519 | 522 | | |
520 | 523 | | |
521 | 524 | | |
522 | 525 | | |
523 | | - | |
| 526 | + | |
524 | 527 | | |
525 | 528 | | |
526 | 529 | | |
| |||
669 | 672 | | |
670 | 673 | | |
671 | 674 | | |
672 | | - | |
673 | | - | |
674 | | - | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
675 | 679 | | |
676 | 680 | | |
677 | 681 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
183 | 183 | | |
184 | 184 | | |
185 | 185 | | |
186 | | - | |
| 186 | + | |
187 | 187 | | |
188 | 188 | | |
189 | 189 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
202 | 202 | | |
203 | 203 | | |
204 | 204 | | |
205 | | - | |
206 | 205 | | |
207 | 206 | | |
208 | | - | |
209 | 207 | | |
210 | 208 | | |
211 | 209 | | |
| |||
222 | 220 | | |
223 | 221 | | |
224 | 222 | | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | 223 | | |
230 | 224 | | |
231 | 225 | | |
| |||
240 | 234 | | |
241 | 235 | | |
242 | 236 | | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | 237 | | |
255 | 238 | | |
256 | 239 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
631 | 631 | | |
632 | 632 | | |
633 | 633 | | |
634 | | - | |
635 | | - | |
636 | | - | |
637 | | - | |
638 | | - | |
639 | | - | |
640 | | - | |
641 | | - | |
642 | | - | |
643 | | - | |
644 | | - | |
645 | | - | |
646 | | - | |
647 | | - | |
648 | | - | |
649 | | - | |
650 | 634 | | |
651 | 635 | | |
652 | 636 | | |
| |||
1115 | 1099 | | |
1116 | 1100 | | |
1117 | 1101 | | |
1118 | | - | |
1119 | 1102 | | |
1120 | 1103 | | |
1121 | | - | |
1122 | 1104 | | |
1123 | 1105 | | |
1124 | 1106 | | |
| |||
1132 | 1114 | | |
1133 | 1115 | | |
1134 | 1116 | | |
1135 | | - | |
1136 | | - | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
1137 | 1121 | | |
1138 | 1122 | | |
1139 | 1123 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
330 | 330 | | |
331 | 331 | | |
332 | 332 | | |
333 | | - | |
334 | 333 | | |
335 | 334 | | |
336 | | - | |
337 | 335 | | |
338 | 336 | | |
339 | 337 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
Lines changed: 46 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
Lines changed: 47 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
0 commit comments