skip generate option for large models and mxfp8 (#942)

arendu · coderabbitai[bot] · web-flow · commit f26b9c3a2fcf · 2026-03-02T13:41:25.000-08:00
## What does this PR do? **Type of change:** New feature **Overview:** Adds a `--skip_generate` flag to `hf_ptq.py` that skips the pre/post-quantization generation preview calls. These calls run `model.generate()` which crashes for very large models (500B+) that are split across GPU and CPU via `device_map="auto"` (e.g., models with Mamba/Triton kernels that cannot handle CPU-offloaded tensors). ## Usage ``` python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path /path/to/model \ --export_path /path/to/output \ --qformat mxfp8 \ --trust_remote_code \ --export_fmt hf \ --batch_size 1 \ --skip_generate \ --kv_cache_qformat none ``` ## Testing Tested with a 500B parameter NemotronH hybrid Mamba/attention model on 4x GB200 GPUs. Without --skip_generate, the script crashes at model.generate() due to Mamba Triton kernels failing on CPU-offloaded tensors. With --skip_generate, the generation preview is skipped and quantization proceeds normally. ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information The --skip_generate flag sets generated_ids_before_ptq = None early, which also causes the post-quantization generate to be skipped via the existing if generated_ids_before_ptq is None: pass guard. Combined with --batch_size 1 (to skip the get_max_batch_size forward-pass probe), this eliminates all forward passes that can crash for device-map-split models.  ## Summary by CodeRabbit * **New Features** * Introduced `--skip_generate` CLI option to skip pre-quantization text and image generation, reducing processing time for very large models. Useful when generation previews are computationally expensive.  --------- Signed-off-by: adithyare <adithyare@nvidia.com> Signed-off-by: Adi Renduchintala <adithya.r@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py
@@ -690,7 +690,9 @@ def pre_quantize(
     ][0:1]
 
     # Generate preview before quantization
-    if model_type == "deepseek":
+    if args.skip_generate:
+        generated_ids_before_ptq = None
+    elif model_type == "deepseek":
         # DeepSeek generation may go OOM, so we skip it
         generated_ids_before_ptq = None
     elif is_nemotron_vl_model and tokenizer is not None:
@@ -703,7 +705,6 @@ def pre_quantize(
             allow_fallback=False,
         )
     else:
-        # Standard generation for non-Nemotron VL models
         generated_ids_before_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
     if model_type == "gptoss" and args.qformat == "nvfp4_mlp_only":
         print("Applying nvfp4 quantization (MoE only) for gpt-oss")
@@ -1084,6 +1085,16 @@ def parse_args() -> argparse.Namespace:
         default=True,
         action=argparse.BooleanOptionalAction,
     )
+    parser.add_argument(
+        "--skip_generate",
+        help=(
+            "Skip pre/post-quantization preview calls that invoke model.generate(). "
+            "Note: this does not skip calibration or batch-size probing. "
+            "For very large models, pair with --batch_size 1 to avoid max-batch probing."
+        ),
+        default=False,
+        action="store_true",
+    )
     parser.add_argument(
         "--low_memory_mode",
         help=(