NVIDIA
diff --git a/‎.claude/skills/debug/SKILL.md‎
Lines changed: 33 additions & 0 deletions b/‎.claude/skills/debug/SKILL.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.github/workflows/_example_tests_runner.yml‎
Lines changed: 2 additions & 1 deletion b/‎.github/workflows/_example_tests_runner.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.github/workflows/example_tests.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/example_tests.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/workflows/gpu_tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/gpu_tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 2 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/conf.py‎
Lines changed: 9 additions & 0 deletions b/‎docs/source/conf.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/source/guides/9_autotune.rst‎
Lines changed: 25 additions & 0 deletions b/‎docs/source/guides/9_autotune.rst‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎examples/llm_eval/README.md‎
Lines changed: 16 additions & 0 deletions b/‎examples/llm_eval/README.md‎
Lines changed: 16 additions & 0 deletions
@@ -0,0 +1,33 @@
+---
+name: debug
+description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
+---
+
+# Remote Docker Debugger
+
+Execute commands inside a Docker container from the host using the file-based command relay.
+
+**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol and examples.
+
+## Quick Reference
+
+```bash
+# Check connection
+bash tools/debugger/client.sh status
+
+# Connect to server (user must start server.sh in Docker first)
+bash tools/debugger/client.sh handshake
+
+# Run a command
+bash tools/debugger/client.sh run "<command>"
+
+# Long-running command (default timeout is 600s)
+bash tools/debugger/client.sh --timeout 1800 run "<command>"
+
+# Cancel the currently running command
+bash tools/debugger/client.sh cancel
+
+# Reconnect after server restart
+bash tools/debugger/client.sh flush
+bash tools/debugger/client.sh handshake
+```
@@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
 modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
+modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
@@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
 /examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
 /examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
 /examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
+/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
 /examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
 /examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners
 
@@ -48,6 +48,7 @@ jobs:
       - name: Install dependencies
         run: |
           # use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
+          pip uninstall -y nvidia-modelopt
           python -m pip install ".${{ inputs.pip_install_extras }}"
 
           if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
@@ -64,7 +65,7 @@ jobs:
           COVERAGE_FILE: ${{ github.workspace }}/.coverage
         run: |
           echo "Running tests for: ${{ inputs.example }}"
-          pytest tests/examples/${{ inputs.example }} --cov
+          python -m pytest tests/examples/${{ inputs.example }} --cov
       - name: Upload coverage to Codecov
         uses: codecov/codecov-action@v5
         with:
 
@@ -132,7 +132,7 @@ jobs:
       docker_image: "nvcr.io/nvidia/nemo:26.02"
       example: ${{ matrix.example }}
       timeout_minutes: 30
-      pip_install_extras: "[hf,dev-test]"
+      pip_install_extras: "[hf,puzzletron,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-1
 
   nemo-non-pr:
@@ -144,7 +144,7 @@ jobs:
       docker_image: "nvcr.io/nvidia/nemo:26.02"
       example: ${{ matrix.example }}
       timeout_minutes: 30
-      pip_install_extras: "[hf,dev-test]"
+      pip_install_extras: "[hf,puzzletron,dev-test]"
       runner: linux-amd64-gpu-rtxpro6000-latest-2
 
   ##### ONNX/TensorRT Example Tests #####
 
@@ -68,7 +68,7 @@ jobs:
       matrix:
         include:
           - example: gpu
-            timeout: 45
+            timeout: 60
             container_image: pytorch:26.01-py3
             # tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
           - example: gpu-regression
 
@@ -94,6 +94,7 @@ repos:
               modelopt/onnx/quantization/ort_patching.py|
               modelopt/torch/_deploy/utils/onnx_utils.py|
               modelopt/torch/export/transformer_engine.py|
+              modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
               modelopt/torch/quantization/export_onnx.py|
               modelopt/torch/quantization/plugins/attention.py|
               modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|
 
@@ -7,6 +7,7 @@ Changelog
 **New Features**
 
 - Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
+- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
 - Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
 - Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
 - Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
@@ -22,6 +23,7 @@ Changelog
 **Bug Fixes**
 
 - Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
+- Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the ``trtexec`` version instead of the TRT Python API when using ``trtexec`` backend.
 
 **Misc**
 
 
@@ -31,6 +31,7 @@
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
 
+import contextlib
 import os
 import sys
 
@@ -44,6 +45,14 @@
 sys.path.insert(0, os.path.abspath("../../"))
 sys.path.append(os.path.abspath("./_ext"))
 
+# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
+# autodoc_mock_imports.  Mocking triton/tensorrt_llm at the Sphinx level can break
+# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
+# to fail inside autosummary.  Importing here — while the real packages are still on
+# sys.path — avoids that problem entirely.
+with contextlib.suppress(Exception):
+    import modelopt.torch  # noqa: F401
+
 # -- Project information -----------------------------------------------------
 
 project = "Model Optimizer"  # pylint: disable=C0103
 
@@ -221,6 +221,31 @@ If the model uses custom TensorRT operations, provide the plugin libraries:
        --output_dir ./results \
        --plugin_libraries /path/to/plugin1.so /path/to/plugin2.so
 
+Remote Autotuning
+-----------------------
+
+TensorRT 10.15+ supports remote autotuning in safety mode (``--safe``), which allows TensorRT's optimization process to be offloaded to a remote hardware. This is useful when optimizing models for different target GPUs without having direct access to them.
+
+To use remote autotuning during Q/DQ placement optimization, run with ``trtexec`` and pass extra args:
+
+.. code-block:: bash
+
+   python -m modelopt.onnx.quantization.autotune \
+       --onnx_path model.onnx \
+       --output_dir ./model_remote_autotuned \
+       --schemes_per_region 50 \
+       --use_trtexec \
+       --trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\" --safe --skipInference"
+
+**Requirements:**
+
+* TensorRT 10.15 or later
+* Valid remote autotuning configuration
+* ``--use_trtexec`` must be set (benchmarking uses ``trtexec`` instead of the TensorRT Python API)
+* ``--safe --skipInference`` must be enabled via ``--trtexec_benchmark_args``
+
+Replace ``<remote autotuning config>`` with an actual remote autotuning configuration string (see ``trtexec --help`` for more details). Other TensorRT benchmark options (e.g. ``--timing_cache``, ``--warmup_runs``, ``--timing_runs``, ``--plugin_libraries``) are also available; run ``--help`` for details.
+
 Low-Level API Usage
 ===================
 
 
@@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
     --batch_size 4
 ```
 
+### Heterogeneous Pruned Checkpoints (Puzzletron)
+
+Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:
+
+```sh
+python lm_eval_hf.py --model hf \
+    --model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
+    --tasks mmlu \
+    --num_fewshot 5 \
+    --batch_size 4
+```
+
+For a quick smoke test, add `--limit 10`.
+
+> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).
+
 ### Quantized (simulated)
 
 - For simulated quantization with any of the default quantization formats: