Skip to content

Commit 79b5f2a

Browse files
authored
Merge branch 'main' into jingyux/diffusion-skip-softmax
2 parents 5ab4ebb + 361f7e3 commit 79b5f2a

246 files changed

Lines changed: 25259 additions & 201 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/debug/SKILL.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
name: debug
3+
description: Run commands inside a remote Docker container via the file-based command relay (tools/debugger). Use when the user says "run in Docker", "run on GPU", "debug remotely", "run test in container", "check nvidia-smi", "run pytest in Docker", or needs to execute any command inside a Docker container that shares the repo filesystem. Requires the user to have started server.sh inside the container first.
4+
---
5+
6+
# Remote Docker Debugger
7+
8+
Execute commands inside a Docker container from the host using the file-based command relay.
9+
10+
**Read `tools/debugger/CLAUDE.md` for full usage details** — it has the protocol and examples.
11+
12+
## Quick Reference
13+
14+
```bash
15+
# Check connection
16+
bash tools/debugger/client.sh status
17+
18+
# Connect to server (user must start server.sh in Docker first)
19+
bash tools/debugger/client.sh handshake
20+
21+
# Run a command
22+
bash tools/debugger/client.sh run "<command>"
23+
24+
# Long-running command (default timeout is 600s)
25+
bash tools/debugger/client.sh --timeout 1800 run "<command>"
26+
27+
# Cancel the currently running command
28+
bash tools/debugger/client.sh cancel
29+
30+
# Reconnect after server restart
31+
bash tools/debugger/client.sh flush
32+
bash tools/debugger/client.sh handshake
33+
```

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
2525
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2626
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
27+
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
2728
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2829
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
2930
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
@@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
4950
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
5051
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
5152
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
53+
/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
5254
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
5355
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
5456
/examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners

.github/workflows/_example_tests_runner.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ jobs:
4848
- name: Install dependencies
4949
run: |
5050
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
51+
pip uninstall -y nvidia-modelopt
5152
python -m pip install ".${{ inputs.pip_install_extras }}"
5253
5354
if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
@@ -64,7 +65,7 @@ jobs:
6465
COVERAGE_FILE: ${{ github.workspace }}/.coverage
6566
run: |
6667
echo "Running tests for: ${{ inputs.example }}"
67-
pytest tests/examples/${{ inputs.example }} --cov
68+
python -m pytest tests/examples/${{ inputs.example }} --cov
6869
- name: Upload coverage to Codecov
6970
uses: codecov/codecov-action@v5
7071
with:

.github/workflows/example_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ jobs:
132132
docker_image: "nvcr.io/nvidia/nemo:26.02"
133133
example: ${{ matrix.example }}
134134
timeout_minutes: 30
135-
pip_install_extras: "[hf,dev-test]"
135+
pip_install_extras: "[hf,puzzletron,dev-test]"
136136
runner: linux-amd64-gpu-rtxpro6000-latest-1
137137

138138
nemo-non-pr:
@@ -144,7 +144,7 @@ jobs:
144144
docker_image: "nvcr.io/nvidia/nemo:26.02"
145145
example: ${{ matrix.example }}
146146
timeout_minutes: 30
147-
pip_install_extras: "[hf,dev-test]"
147+
pip_install_extras: "[hf,puzzletron,dev-test]"
148148
runner: linux-amd64-gpu-rtxpro6000-latest-2
149149

150150
##### ONNX/TensorRT Example Tests #####

.github/workflows/gpu_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
matrix:
6969
include:
7070
- example: gpu
71-
timeout: 45
71+
timeout: 60
7272
container_image: pytorch:26.01-py3
7373
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
7474
- example: gpu-regression

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ repos:
9494
modelopt/onnx/quantization/ort_patching.py|
9595
modelopt/torch/_deploy/utils/onnx_utils.py|
9696
modelopt/torch/export/transformer_engine.py|
97+
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
9798
modelopt/torch/quantization/export_onnx.py|
9899
modelopt/torch/quantization/plugins/attention.py|
99100
modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Changelog
77
**New Features**
88

99
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
10+
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
1011
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
1112
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
1213
- Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
@@ -22,6 +23,7 @@ Changelog
2223
**Bug Fixes**
2324

2425
- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
26+
- Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the ``trtexec`` version instead of the TRT Python API when using ``trtexec`` backend.
2527

2628
**Misc**
2729

docs/source/conf.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
# import sys
3232
# sys.path.insert(0, os.path.abspath('.'))
3333

34+
import contextlib
3435
import os
3536
import sys
3637

@@ -44,6 +45,14 @@
4445
sys.path.insert(0, os.path.abspath("../../"))
4546
sys.path.append(os.path.abspath("./_ext"))
4647

48+
# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
49+
# autodoc_mock_imports. Mocking triton/tensorrt_llm at the Sphinx level can break
50+
# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
51+
# to fail inside autosummary. Importing here — while the real packages are still on
52+
# sys.path — avoids that problem entirely.
53+
with contextlib.suppress(Exception):
54+
import modelopt.torch # noqa: F401
55+
4756
# -- Project information -----------------------------------------------------
4857

4958
project = "Model Optimizer" # pylint: disable=C0103

docs/source/guides/9_autotune.rst

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,31 @@ If the model uses custom TensorRT operations, provide the plugin libraries:
221221
--output_dir ./results \
222222
--plugin_libraries /path/to/plugin1.so /path/to/plugin2.so
223223
224+
Remote Autotuning
225+
-----------------------
226+
227+
TensorRT 10.15+ supports remote autotuning in safety mode (``--safe``), which allows TensorRT's optimization process to be offloaded to a remote hardware. This is useful when optimizing models for different target GPUs without having direct access to them.
228+
229+
To use remote autotuning during Q/DQ placement optimization, run with ``trtexec`` and pass extra args:
230+
231+
.. code-block:: bash
232+
233+
python -m modelopt.onnx.quantization.autotune \
234+
--onnx_path model.onnx \
235+
--output_dir ./model_remote_autotuned \
236+
--schemes_per_region 50 \
237+
--use_trtexec \
238+
--trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\" --safe --skipInference"
239+
240+
**Requirements:**
241+
242+
* TensorRT 10.15 or later
243+
* Valid remote autotuning configuration
244+
* ``--use_trtexec`` must be set (benchmarking uses ``trtexec`` instead of the TensorRT Python API)
245+
* ``--safe --skipInference`` must be enabled via ``--trtexec_benchmark_args``
246+
247+
Replace ``<remote autotuning config>`` with an actual remote autotuning configuration string (see ``trtexec --help`` for more details). Other TensorRT benchmark options (e.g. ``--timing_cache``, ``--warmup_runs``, ``--timing_runs``, ``--plugin_libraries``) are also available; run ``--help`` for details.
248+
224249
Low-Level API Usage
225250
===================
226251

examples/llm_eval/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
4040
--batch_size 4
4141
```
4242

43+
### Heterogeneous Pruned Checkpoints (Puzzletron)
44+
45+
Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:
46+
47+
```sh
48+
python lm_eval_hf.py --model hf \
49+
--model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
50+
--tasks mmlu \
51+
--num_fewshot 5 \
52+
--batch_size 4
53+
```
54+
55+
For a quick smoke test, add `--limit 10`.
56+
57+
> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).
58+
4359
### Quantized (simulated)
4460

4561
- For simulated quantization with any of the default quantization formats:

0 commit comments

Comments
 (0)