NVIDIA · kevalmorabia97 · Apr 15, 2026 · Oct 27, 2025 · Oct 27, 2025 · Oct 27, 2025
@@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
 modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
+modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
@@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
 /examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
 /examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
 /examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
+/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
 /examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners

@@ -48,6 +48,7 @@ jobs:
       - name: Install dependencies
         run: |
           # use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
+          pip uninstall -y nvidia-modelopt
           python -m pip install ".${{ inputs.pip_install_extras }}"
 
           if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
@@ -64,7 +65,7 @@ jobs:
           COVERAGE_FILE: ${{ github.workspace }}/.coverage
         run: |
           echo "Running tests for: ${{ inputs.example }}"
-          pytest tests/examples/${{ inputs.example }} --cov
+          python -m pytest tests/examples/${{ inputs.example }} --cov
       - name: Upload coverage to Codecov
         uses: codecov/codecov-action@v5
         with:

@@ -132,7 +132,7 @@ jobs:
       docker_image: "nvcr.io/nvidia/nemo:26.02"
       example: ${{ matrix.example }}
       timeout_minutes: 30
-      pip_install_extras: "[hf,dev-test]"
+      pip_install_extras: "[hf,puzzletron,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-1
 
   nemo-non-pr:
@@ -144,7 +144,7 @@ jobs:
       docker_image: "nvcr.io/nvidia/nemo:26.02"
       example: ${{ matrix.example }}
       timeout_minutes: 30
-      pip_install_extras: "[hf,dev-test]"
+      pip_install_extras: "[hf,puzzletron,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-2
 
   ##### ONNX/TensorRT Example Tests #####

@@ -68,7 +68,7 @@ jobs:
       matrix:
         include:
           - example: gpu
-            timeout: 45
+            timeout: 60
             container_image: pytorch:26.01-py3
             # tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
           - example: gpu-regression

@@ -94,6 +94,7 @@ repos:
               modelopt/onnx/quantization/ort_patching.py|
               modelopt/torch/_deploy/utils/onnx_utils.py|
               modelopt/torch/export/transformer_engine.py|
+              modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
               modelopt/torch/quantization/export_onnx.py|
               modelopt/torch/quantization/plugins/attention.py|
               modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -7,6 +7,7 @@ Changelog
 **New Features**
 
 - Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
+- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
 - Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
 - Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
 - Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -31,6 +31,7 @@
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
 
+import contextlib
 import os
 import sys
 
@@ -44,6 +45,14 @@
 sys.path.insert(0, os.path.abspath("../../"))
 sys.path.append(os.path.abspath("./_ext"))
 
+# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
+# autodoc_mock_imports.  Mocking triton/tensorrt_llm at the Sphinx level can break
+# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
+# to fail inside autosummary.  Importing here — while the real packages are still on
+# sys.path — avoids that problem entirely.
+with contextlib.suppress(Exception):
+    import modelopt.torch  # noqa: F401
+
 # -- Project information -----------------------------------------------------
 
 project = "Model Optimizer"  # pylint: disable=C0103

@@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
     --batch_size 4
 ```
 
+### Heterogeneous Pruned Checkpoints (Puzzletron)
+
+Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:
+
+```sh
+python lm_eval_hf.py --model hf \
+    --model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
+    --tasks mmlu \
+    --num_fewshot 5 \
+    --batch_size 4
+```
+
+For a quick smoke test, add `--limit 10`.
+
+> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).
+
 ### Quantized (simulated)
 
 - For simulated quantization with any of the default quantization formats:

@@ -36,11 +36,19 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import contextlib
 import warnings
 
 import datasets
+import lm_eval
 from lm_eval import utils
 from lm_eval.__main__ import cli_evaluate, parse_eval_args, setup_parser
+
+if not lm_eval.__version__.startswith("0.4.8"):
+    warnings.warn(
+        f"lm_eval_hf.py is tested with lm-eval 0.4.8; found {lm_eval.__version__}. "
+        "Later versions may have incompatible API changes."
+    )
 from lm_eval.api.model import T
 from lm_eval.models.huggingface import HFLM
 from quantization_utils import quantize_model
@@ -50,9 +58,29 @@
 from modelopt.torch.quantization.utils import is_quantized
 from modelopt.torch.sparsity.attention_sparsity.conversion import is_attn_sparsified
 
+try:
+    import modelopt.torch.puzzletron as mtpz
+
+    _ANYMODEL_AVAILABLE = True
+except ImportError:
+    _ANYMODEL_AVAILABLE = False
+
+
+def _anymodel_patcher_context(pretrained, trust_remote_code=False):
+    """Return a deci_x_patcher context if *pretrained* is a Puzzletron checkpoint, else a no-op."""
+    if not _ANYMODEL_AVAILABLE or not pretrained:
+        return contextlib.nullcontext()
+    try:
+        descriptor = mtpz.anymodel.resolve_descriptor_from_pretrained(
+            pretrained, trust_remote_code=trust_remote_code
+        )
+    except (ValueError, AttributeError):
+        return contextlib.nullcontext()
+    return mtpz.anymodel.deci_x_patcher(model_descriptor=descriptor)
+
 
 def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
-    """Overrides the HFLM.create_from_arg_obj"""
+    """Override HFLM.create_from_arg_obj to add quantization, sparsity, and Puzzletron support."""
 
     quant_cfg = arg_dict.pop("quant_cfg", None)
     auto_quantize_bits = arg_dict.pop("auto_quantize_bits", None)
@@ -72,7 +100,10 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
     # Enable automatic save/load of modelopt state huggingface checkpointing
     mto.enable_huggingface_checkpointing()
 
-    model_obj = cls(**arg_dict, **additional_config)
+    with _anymodel_patcher_context(
+        arg_dict.get("pretrained"), arg_dict.get("trust_remote_code", False)
+    ):
+        model_obj = cls(**arg_dict, **additional_config)
     model_obj.tokenizer.padding_side = "left"
     if is_quantized(model_obj.model):
         # return if model is already quantized
@@ -109,10 +140,28 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
     return model_obj
 
 
+def create_from_arg_string(
+    cls: type[T], arg_string: str, additional_config: dict | None = None
+) -> T:
+    """Override HFLM.create_from_arg_string to support Puzzletron checkpoints."""
+    args = utils.simple_parse_args_string(arg_string)
+    additional_config = {} if additional_config is None else additional_config
+    args2 = {k: v for k, v in additional_config.items() if v is not None}
+
+    mto.enable_huggingface_checkpointing()
+
+    with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
+        model_obj = cls(**args, **args2)
+
+    return model_obj
+
+
 HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
+HFLM.create_from_arg_string = classmethod(create_from_arg_string)
 
 
 def setup_parser_with_modelopt_args():
+    """Extend the lm-eval argument parser with ModelOpt quantization and sparsity options."""
     parser = setup_parser()
     parser.add_argument(
         "--quant_cfg",

@@ -46,6 +46,9 @@ Note that the default dataset for pruning and quantization is [`nemotron-post-tr
 hf auth login --token <your token>
 ```
 
+> [!WARNING]
+> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers.
+
 ## Pruning
 
 This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).
@@ -92,7 +95,7 @@ This section shows how to distill a student model from a teacher model in the Me
 
 This can be used stand-alone or after [Pruning](#pruning) / [Post-Training Quantization](#post-training-quantization) to recover accuracy of the model by distilling from the original model (teacher).
 
-The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
+The [distill.py](distill.py) script supports both standard HuggingFace checkpoints and [Puzzletron AnyModel](../puzzletron/README.md) checkpoints as student/teacher inputs. Just pass the checkpoint path via `--student_hf_path` / `--teacher_hf_path`. The distilled model is saved to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
 
 ### Data Preparation
 
@@ -158,9 +161,22 @@ torchrun --nproc_per_node 8 distill.py \
 
 To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
 
-### Convert Megatron checkpoint to Hugging Face format
+### Converting to Hugging Face format (optional)
+
+The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:
+
+**Inline** -- add `--hf_export_path` and `--student_hf_model` to the `distill.py` command to automatically convert the final checkpoint after distillation:
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 distill.py \
+    ... \
+    --hf_export_path /path/to/save/distilled_hf_ckpt \
+    --student_hf_model Qwen/Qwen3-4B
+```
+
+`--student_hf_model` should match the base architecture of the student (used as a template for export). For non-Puzzletron (i.e. standard) models, it should be same as `--student_hf_path`.
 
-To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
+**Separate conversion** -- convert any saved iteration using the Megatron-Bridge conversion script:
 
 ```bash
 uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
@@ -169,7 +185,11 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
     --hf-path <path_to_save_distilled_hf_ckpt>
 ```
 
-For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
+For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
+
+### Distillation Results
+
+See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
 
 ## Post-Training Quantization