Skip to content

Commit c50a346

Browse files
authored
Merge branch 'main' into fix-example-script
Signed-off-by: bkartal-dev <bkartal@nvidia.com>
2 parents f8ee452 + 714dbd2 commit c50a346

92 files changed

Lines changed: 3558 additions & 2437 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.dockerignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,3 @@ venv/
6464
**.pkl
6565
**.pickle
6666
**.tar.gz
67-
**.nemo

.github/CODEOWNERS

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
4646
/examples/llm_sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
4747
/examples/megatron_bridge @NVIDIA/modelopt-examples-megatron-codeowners
4848
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
49-
/examples/nemo_run @NVIDIA/modelopt-examples-megatron-codeowners
5049
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
5150
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
5251
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: Delete Outdated PR Branches
2+
3+
on:
4+
schedule:
5+
- cron: "0 9 * * 1" # Every Monday at 9:00 UTC
6+
workflow_dispatch: # On-demand
7+
8+
permissions:
9+
contents: write
10+
pull-requests: read
11+
12+
jobs:
13+
delete-outdated-pr-branches:
14+
runs-on: ubuntu-latest
15+
timeout-minutes: 15
16+
steps:
17+
- uses: actions/checkout@v6
18+
with:
19+
fetch-depth: 0
20+
- name: Delete branches for closed/merged PRs
21+
env:
22+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
23+
run: |
24+
REPO="${{ github.repository }}"
25+
DELETED=0
26+
SKIPPED=0
27+
28+
# List all remote branches matching pull-request/<num>
29+
git fetch --prune origin
30+
for branch in $(git branch -r | grep -oP 'origin/pull-request/\K[0-9]+' | sort -un); do
31+
FULL_BRANCH="pull-request/${branch}"
32+
STATE=$(gh pr view "$branch" --repo "$REPO" --json state --jq '.state' 2>/dev/null || echo "")
33+
34+
if [ "$STATE" = "CLOSED" ] || [ "$STATE" = "MERGED" ]; then
35+
echo "Deleting branch '${FULL_BRANCH}' (PR #${branch} is ${STATE})"
36+
git push origin --delete "$FULL_BRANCH" && DELETED=$((DELETED + 1)) || true
37+
elif [ "$STATE" = "OPEN" ]; then
38+
echo "Skipping branch '${FULL_BRANCH}' (PR #${branch} is still OPEN)"
39+
SKIPPED=$((SKIPPED + 1))
40+
else
41+
echo "Skipping branch '${FULL_BRANCH}' (could not determine PR #${branch} state)"
42+
SKIPPED=$((SKIPPED + 1))
43+
fi
44+
done
45+
46+
echo ""
47+
echo "Done. Deleted: ${DELETED}, Skipped: ${SKIPPED}"

.github/workflows/pages.yml

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ name: Docs
22

33
on:
44
pull_request:
5+
types: [opened, synchronize, reopened, closed]
56
branches: [main, release/*, feature/*]
67
push:
78
branches: [main]
@@ -14,11 +15,9 @@ concurrency:
1415
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}
1516
cancel-in-progress: true
1617

17-
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
1818
permissions:
19-
contents: read
20-
pages: write
21-
id-token: write
19+
contents: write # push to gh-pages branch
20+
pull-requests: write # post/update preview URL comment on PRs
2221

2322
jobs:
2423
build-docs:
@@ -30,18 +29,44 @@ jobs:
3029
- name: Build docs
3130
run: pip install tox && tox -e build-docs
3231
- name: Upload docs artifact
33-
if: github.event_name == 'push'
34-
uses: actions/upload-pages-artifact@v4
32+
if: github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
33+
uses: actions/upload-artifact@v4
3534
with:
35+
name: docs-html
3636
path: docs/build/html
3737

38+
deploy-preview:
39+
if: github.event_name == 'pull_request'
40+
runs-on: ubuntu-latest
41+
timeout-minutes: 30
42+
# Per-PR concurrency without cancel-in-progress so 'closed' cleanup always runs
43+
concurrency:
44+
group: pr-preview-${{ github.event.pull_request.number }}
45+
steps:
46+
- uses: actions/checkout@v6
47+
- uses: ./.github/actions/ubuntu-setup
48+
- name: Build docs
49+
if: github.event.action != 'closed'
50+
run: pip install tox && tox -e build-docs
51+
- name: Deploy / remove PR preview
52+
uses: rossjrw/pr-preview-action@v1
53+
with:
54+
source-dir: docs/build/html
55+
3856
deploy-gh-pages:
39-
if: github.event_name == 'push'
57+
if: github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch'
4058
needs: build-docs
4159
runs-on: ubuntu-latest
42-
environment:
43-
name: github-pages
44-
url: ${{ steps.deployment.outputs.page_url }}
4560
steps:
46-
- id: deployment
47-
uses: actions/deploy-pages@v4
61+
- uses: actions/checkout@v6
62+
- name: Download docs artifact
63+
uses: actions/download-artifact@v4
64+
with:
65+
name: docs-html
66+
path: docs/build/html
67+
- name: Deploy to GitHub Pages
68+
uses: JamesIves/github-pages-deploy-action@v4
69+
with:
70+
folder: docs/build/html
71+
# Preserve PR preview subdirectories deployed by the deploy-preview job
72+
clean-exclude: pr-preview

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,4 +58,3 @@ venv/
5858
**.pkl
5959
**.pickle
6060
**.tar.gz
61-
**.nemo

.gitlab/.gitlab-ci.yml

Lines changed: 0 additions & 13 deletions
This file was deleted.

.gitlab/tests.yml

Lines changed: 0 additions & 37 deletions
This file was deleted.

CHANGELOG.rst

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,23 @@
11
NVIDIA Model Optimizer Changelog
22
================================
3+
0.44 (2026-04-xx)
34

4-
0.43 (2026-03-xx)
5+
**New Features**
6+
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
7+
8+
9+
0.44 (2026-05-xx)
10+
^^^^^^^^^^^^^^^^^
11+
12+
**New Features**
13+
14+
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
15+
16+
**Bug Fixes**
17+
18+
- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
19+
20+
0.43 (2026-04-09)
521
^^^^^^^^^^^^^^^^^
622

723
**Bug Fixes**
@@ -25,6 +41,7 @@ NVIDIA Model Optimizer Changelog
2541
- Enable PTQ workflow for Qwen3.5 MoE models.
2642
- Enable PTQ workflow for the Kimi-K2.5 model.
2743
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
44+
- Add ``nvfp4_experts_only`` quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
2845
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
2946
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
3047
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
@@ -34,12 +51,16 @@ NVIDIA Model Optimizer Changelog
3451
- Add support for block-granular RHT for non-power-of-2 dimensions.
3552
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
3653

54+
**Deprecations**
55+
56+
- Remove deprecated NeMo-2.0 Framework references.
57+
3758
**Misc**
3859

3960
- Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
4061
- Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
4162

42-
0.42 (2026-02-xx)
63+
0.42 (2026-03-10)
4364
^^^^^^^^^^^^^^^^^
4465

4566
**Bug Fixes**

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,9 +92,9 @@ more fine-grained control on installed dependencies or for alternative docker im
9292
| **Technique** | **Description** | **Examples** | **Docs** |
9393
| :------------: | :------------: | :------------: | :------------: |
9494
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! | \[[LLMs](./examples/llm_ptq/)\] \[[diffusers](./examples/diffusers/)\] \[[VLMs](./examples/vlm_ptq/)\] \[[onnx](./examples/onnx_ptq/)\] \[[windows](./examples/windows/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
95-
| Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[NeMo](./examples/llm_qat#nemo-qatqad-simplified-flow-example)\] \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
96-
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[PyTorch](./examples/pruning/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\] |
97-
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[NeMo](./examples/llm_distill#knowledge-distillation-kd-for-nvidia-nemo-models)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
95+
| Quantization Aware Training | Refine accuracy even further with a few training steps! | \[[Hugging Face](./examples/llm_qat/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
96+
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! | \[[General](./examples/pruning/)\] \[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\] | |
97+
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! | \[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\] \[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] \[[Hugging Face](./examples/llm_distill/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
9898
| Speculative Decoding | Train draft modules to predict extra tokens during inference! | \[[Megatron](./examples/speculative_decoding#mlm-example)\] \[[Hugging Face](./examples/speculative_decoding/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\] |
9999
| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations | \[[PyTorch](./examples/llm_sparsity/)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\] |
100100

docs/source/deployment/1_tensorrt_llm.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the T
1515

1616
This conversion is achieved by:
1717

18-
#. Converting Huggingface, NeMo and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint.
18+
#. Converting Huggingface, Megatron-Bridge and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint.
1919
#. Building TensorRT-LLM engine from the TensorRT-LLM checkpoint.
2020

2121

0 commit comments

Comments
 (0)