Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
6c038f9
Add modelopt/torch/_compress CODEOWNERS
kevalmorabia97 Oct 27, 2025
230cee1
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 27, 2025
54c5f0f
Remove llm_ptq example tests from CICD
kevalmorabia97 Oct 27, 2025
9eeee25
E2E test for the experimental compress algorithm based on https://arx…
danielkorzekwa Oct 28, 2025
ad1d18e
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 28, 2025
cef3655
Add convert_llama3_config_to_decilm_config + unit test (#465)
danielkorzekwa Oct 29, 2025
002b8b5
Implement nas.convert() api for the compress algorithm (#482)
danielkorzekwa Oct 31, 2025
1c12fd8
modelopt nas search() implementation for the compress algorithm (#490)
danielkorzekwa Nov 3, 2025
f7d547f
Add decilm modelling code (#505)
danielkorzekwa Nov 12, 2025
50a580c
Compress tutorial (PoC) (#492)
danielkorzekwa Nov 12, 2025
b121945
Add llama converter (no dependency on internal Nvidia code) - part 1/…
danielkorzekwa Nov 13, 2025
866e400
llama converter is self-contained now (no dependency on internal nvid…
danielkorzekwa Nov 14, 2025
0868f1c
Add integration test for attention pruning (#562)
danielkorzekwa Nov 14, 2025
69726cc
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
07ca24d
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
1dde209
Add score_pruning_activations (step 2/6) (#563)
danielkorzekwa Nov 18, 2025
2e559e7
Update README.md
kevalmorabia97 Nov 18, 2025
f10be0d
Add activation hooks used for pruning (#576)
danielkorzekwa Nov 20, 2025
194b532
Add sewing kit and utilities used for pruning scoring - pruning scori…
danielkorzekwa Nov 24, 2025
8c9cdd4
Add L2NormHook and use it in megatron.py (#599)
danielkorzekwa Nov 26, 2025
1f72466
Add pruning checkpoints for the compress algorithm (#607)
danielkorzekwa Nov 27, 2025
97fe7f0
Add build replacement library to the compress algorithm. (#616)
danielkorzekwa Dec 1, 2025
954103e
Add subblock stats to the compress algorithm (#623)
danielkorzekwa Dec 1, 2025
dcc425f
Add 1-block scoring to the compress algorithm (#625)
danielkorzekwa Dec 2, 2025
56d95de
Add checkpoint save/load to ForwardHook + add IterativeChannelContrib…
danielkorzekwa Dec 2, 2025
74aae83
Add MIP step to the compress algorithm (#627)
danielkorzekwa Dec 4, 2025
a1f63bc
Merge branch 'main' into feature/compress
kevalmorabia97 Dec 8, 2025
a99f503
Remove unused mip functions + fix multi-gpu test (#660)
kevalmorabia97 Dec 8, 2025
67489f4
Fix a bug in IterativeChannelContributionHook + tools for activation …
danielkorzekwa Dec 11, 2025
1d8bd20
Remove runtime.py and directly use torch dist utils + remove unused f…
kevalmorabia97 Dec 11, 2025
f7a0cb0
Use shared activation hooks component in the puzzle algorithm (#687)
danielkorzekwa Dec 17, 2025
db866d9
Clean up Puzzle Compress Tutorial (#711)
LianaMikael Dec 22, 2025
2e813bf
Two bug fixes: mix checkpointing and dtype (#718)
danielkorzekwa Dec 22, 2025
83ac3b1
Merge remote-tracking branch 'origin/main' into feature/compress
kevalmorabia97 Jan 13, 2026
0eecfc6
Fix test assertions for 2-gpu (#772)
kevalmorabia97 Jan 13, 2026
43b3cfa
Rename compress to puzzletron (#776)
kevalmorabia97 Jan 14, 2026
4c30bd5
Add NeMo Conversion Scripts to Puzzletron (#784)
LianaMikael Jan 15, 2026
96bb0ba
Merge branch 'main' into feature/compress
kevalmorabia97 Mar 3, 2026
8c84fee
[CI] Update to only run puzzletron tests
kevalmorabia97 Mar 3, 2026
5812777
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 3, 2026
5f77c81
Pin torchprofile==0.0.4 to fix CI
kevalmorabia97 Mar 10, 2026
82df595
Add anymodel-core to feature/puzzletron (#974)
danielkorzekwa Mar 11, 2026
4dc9932
Draft: anymodel activation scoring (#989)
danielkorzekwa Mar 12, 2026
d358eb3
Draft: Merge anymodel pruning (#990)
danielkorzekwa Mar 12, 2026
8e827f3
Draft: Merging anymodel:build_library_and_stats (#993)
danielkorzekwa Mar 12, 2026
eb4b210
Draft: merge any model calc one block scores (#994)
danielkorzekwa Mar 12, 2026
8fe318d
Draft: merge any_model: mip_and_realize_models (#995)
danielkorzekwa Mar 13, 2026
2fbdf0e
Update uv.lock for nspect puzzletron scanning
kevalmorabia97 Mar 13, 2026
1b42f0b
Dkorzekwa/any model other models (#1007)
danielkorzekwa Mar 17, 2026
67999eb
Dkorzekwa/anymodel gptoss (#1020)
danielkorzekwa Mar 17, 2026
660dc17
Merge any_model tutorial (#1035)
danielkorzekwa Mar 19, 2026
01cba6a
Merge mbridge distillation for any_model (#1036)
danielkorzekwa Mar 20, 2026
2b6572c
MR branch for the remaining difference between dkorzekwa/any_model an…
danielkorzekwa Mar 20, 2026
110316a
Dkorzekwa/decilm hf code cleanup (#1071)
danielkorzekwa Mar 23, 2026
4190275
Dkorzekwa/decilm hf code cleanup 2 (#1073)
danielkorzekwa Mar 23, 2026
0708ca2
Dkorzekwa/anymodel subblock stats (#1085)
danielkorzekwa Mar 24, 2026
3193f30
Dkorzekwa/anymodel subblock stats nodecilm (#1102)
danielkorzekwa Mar 24, 2026
928036e
Dkorzekwa/decilm cleanup post subblockstats (#1103)
danielkorzekwa Mar 24, 2026
e508b76
code clean up (#1110)
danielkorzekwa Mar 24, 2026
f460d16
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 25, 2026
2f55c73
Dkorzekwa/puzzletron use importance hooks from prune (#1115)
danielkorzekwa Mar 25, 2026
c5ec50b
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Mar 25, 2026
d257871
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 30, 2026
7e15fdd
Revert CICD and other config changes
kevalmorabia97 Mar 30, 2026
d0209dc
Make Qwen and QwenVL descriptor generic so can be used for other vari…
kevalmorabia97 Mar 25, 2026
d987bad
Set strict=True in distill_hf export
kevalmorabia97 Mar 30, 2026
75651cc
add basic ruff fixes
kevalmorabia97 Mar 25, 2026
03118ce
Apply coderabbit suggestions
kevalmorabia97 Mar 30, 2026
2a170b9
Set weights_only=True in checkpoint_utils.py
kevalmorabia97 Mar 30, 2026
d6f8ddb
More fixes
kevalmorabia97 Mar 30, 2026
4621b65
reuse puzzletron tokenizer in other tests
kevalmorabia97 Mar 30, 2026
be4bd3a
disable puzzletron in coverage check as its covered in gpu tests only
kevalmorabia97 Mar 30, 2026
45426ca
Remove custom DistillationProvider and simplify mbridge distillation …
kevalmorabia97 Apr 1, 2026
5429d86
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 1, 2026
41b8ca7
fix test
kevalmorabia97 Apr 2, 2026
33b9230
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 7, 2026
25266b8
fix hydra config dtype resolution in puzzletron validation tools (#1202)
j-rausch Apr 8, 2026
fd5694d
Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_ev…
j-rausch Apr 9, 2026
dedcad0
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Apr 10, 2026
d0cdbfd
minor cleanup
kevalmorabia97 Apr 10, 2026
ee01ace
Move block_config out of deci_lm_hf_code folder
kevalmorabia97 Apr 10, 2026
7ce3332
Fix critical bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
c7700a9
Fix critical and major bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
9f3cc2d
Fix minor bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
66fccd2
fix decoder_layer_cls failure on trust_remote_code models (#1222)
j-rausch Apr 10, 2026
7053c61
fix puzzletron container test path; add NeMo setup docs (#1231)
j-rausch Apr 10, 2026
05c6d3b
Add MoE/Nemotron fixes to support Transformers 5.5
kevalmorabia97 Apr 10, 2026
0d6eb7e
Update changelog
kevalmorabia97 Apr 10, 2026
ac8397b
Refactor puzzletron imports: relative imports, public API, logger fix
kevalmorabia97 Apr 10, 2026
977d60a
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Apr 10, 2026
01a4e55
Fix custom tiny tokenizer
kevalmorabia97 Apr 11, 2026
f361ca6
Fix `RuntimeError: pidfd_getfd: Operation not permitted`
kevalmorabia97 Apr 11, 2026
a7eedf8
Add __all__ for modules
kevalmorabia97 Apr 11, 2026
ed5fd68
Fix test_puzzletron assertions for transformers v5.5
kevalmorabia97 Apr 11, 2026
547e76d
Fix doc building
kevalmorabia97 Apr 11, 2026
62070ae
Fix Qwen2.5 test assertion as per CI machine
kevalmorabia97 Apr 13, 2026
38d9522
Address coderabbit comments
kevalmorabia97 Apr 13, 2026
6395b1e
copy custom modeling files to pruned checkpoint dirs (#1245)
j-rausch Apr 13, 2026
d88dfcb
consolidate mbridge distillation: merge distill_hf.py into distill.py…
j-rausch Apr 13, 2026
06eaf74
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 13, 2026
3f41819
Address minor coderabbit comments
kevalmorabia97 Apr 13, 2026
ad8cf9a
fix lm-eval version conflict in puzzletron requirements (#1257)
j-rausch Apr 15, 2026
5e4c43e
fix hybrid model subblock param counting: all FFN sizes reported iden…
j-rausch Apr 15, 2026
2345af7
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 15, 2026
e0bb89d
Fix test
kevalmorabia97 Apr 15, 2026
47a612e
Fix test path
kevalmorabia97 Apr 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
Expand All @@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
/examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/_example_tests_runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ jobs:
- name: Install dependencies
run: |
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
pip uninstall -y nvidia-modelopt
python -m pip install ".${{ inputs.pip_install_extras }}"

if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
Expand All @@ -64,7 +65,7 @@ jobs:
COVERAGE_FILE: ${{ github.workspace }}/.coverage
run: |
echo "Running tests for: ${{ inputs.example }}"
pytest tests/examples/${{ inputs.example }} --cov
python -m pytest tests/examples/${{ inputs.example }} --cov
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v5
with:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/example_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ jobs:
docker_image: "nvcr.io/nvidia/nemo:26.02"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
pip_install_extras: "[hf,puzzletron,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-1

nemo-non-pr:
Expand All @@ -144,7 +144,7 @@ jobs:
docker_image: "nvcr.io/nvidia/nemo:26.02"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
pip_install_extras: "[hf,puzzletron,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-2

##### ONNX/TensorRT Example Tests #####
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
matrix:
include:
- example: gpu
timeout: 45
timeout: 60
container_image: pytorch:26.01-py3
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
- example: gpu-regression
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ repos:
modelopt/onnx/quantization/ort_patching.py|
modelopt/torch/_deploy/utils/onnx_utils.py|
modelopt/torch/export/transformer_engine.py|
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
modelopt/torch/quantization/export_onnx.py|
modelopt/torch/quantization/plugins/attention.py|
modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Changelog
**New Features**

- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
- Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
Expand Down
9 changes: 9 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
# import sys
# sys.path.insert(0, os.path.abspath('.'))

import contextlib
import os
import sys

Expand All @@ -44,6 +45,14 @@
sys.path.insert(0, os.path.abspath("../../"))
sys.path.append(os.path.abspath("./_ext"))

# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
# autodoc_mock_imports. Mocking triton/tensorrt_llm at the Sphinx level can break
# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
# to fail inside autosummary. Importing here — while the real packages are still on
# sys.path — avoids that problem entirely.
with contextlib.suppress(Exception):
import modelopt.torch # noqa: F401

# -- Project information -----------------------------------------------------

project = "Model Optimizer" # pylint: disable=C0103
Expand Down
16 changes: 16 additions & 0 deletions examples/llm_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
--batch_size 4
```

### Heterogeneous Pruned Checkpoints (Puzzletron)

Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:

```sh
python lm_eval_hf.py --model hf \
--model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

For a quick smoke test, add `--limit 10`.

> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).

### Quantized (simulated)

- For simulated quantization with any of the default quantization formats:
Expand Down
53 changes: 51 additions & 2 deletions examples/llm_eval/lm_eval_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,19 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import warnings

import datasets
import lm_eval
from lm_eval import utils
from lm_eval.__main__ import cli_evaluate, parse_eval_args, setup_parser

if not lm_eval.__version__.startswith("0.4.8"):
warnings.warn(
f"lm_eval_hf.py is tested with lm-eval 0.4.8; found {lm_eval.__version__}. "
"Later versions may have incompatible API changes."
)
from lm_eval.api.model import T
from lm_eval.models.huggingface import HFLM
from quantization_utils import quantize_model
Expand All @@ -50,9 +58,29 @@
from modelopt.torch.quantization.utils import is_quantized
from modelopt.torch.sparsity.attention_sparsity.conversion import is_attn_sparsified

try:
import modelopt.torch.puzzletron as mtpz

_ANYMODEL_AVAILABLE = True
except ImportError:
_ANYMODEL_AVAILABLE = False


def _anymodel_patcher_context(pretrained, trust_remote_code=False):
"""Return a deci_x_patcher context if *pretrained* is a Puzzletron checkpoint, else a no-op."""
if not _ANYMODEL_AVAILABLE or not pretrained:
return contextlib.nullcontext()
try:
descriptor = mtpz.anymodel.resolve_descriptor_from_pretrained(
pretrained, trust_remote_code=trust_remote_code
)
except (ValueError, AttributeError):
return contextlib.nullcontext()
return mtpz.anymodel.deci_x_patcher(model_descriptor=descriptor)


def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
"""Overrides the HFLM.create_from_arg_obj"""
"""Override HFLM.create_from_arg_obj to add quantization, sparsity, and Puzzletron support."""

quant_cfg = arg_dict.pop("quant_cfg", None)
auto_quantize_bits = arg_dict.pop("auto_quantize_bits", None)
Expand All @@ -72,7 +100,10 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
# Enable automatic save/load of modelopt state huggingface checkpointing
mto.enable_huggingface_checkpointing()

model_obj = cls(**arg_dict, **additional_config)
with _anymodel_patcher_context(
arg_dict.get("pretrained"), arg_dict.get("trust_remote_code", False)
):
model_obj = cls(**arg_dict, **additional_config)
model_obj.tokenizer.padding_side = "left"
if is_quantized(model_obj.model):
# return if model is already quantized
Expand Down Expand Up @@ -109,10 +140,28 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
return model_obj


def create_from_arg_string(
cls: type[T], arg_string: str, additional_config: dict | None = None
) -> T:
"""Override HFLM.create_from_arg_string to support Puzzletron checkpoints."""
args = utils.simple_parse_args_string(arg_string)
additional_config = {} if additional_config is None else additional_config
args2 = {k: v for k, v in additional_config.items() if v is not None}

mto.enable_huggingface_checkpointing()

with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
model_obj = cls(**args, **args2)

return model_obj


HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
HFLM.create_from_arg_string = classmethod(create_from_arg_string)


def setup_parser_with_modelopt_args():
"""Extend the lm-eval argument parser with ModelOpt quantization and sparsity options."""
parser = setup_parser()
parser.add_argument(
"--quant_cfg",
Expand Down
28 changes: 24 additions & 4 deletions examples/megatron_bridge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ Note that the default dataset for pruning and quantization is [`nemotron-post-tr
hf auth login --token <your token>
```

> [!WARNING]
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers.

## Pruning

This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).
Expand Down Expand Up @@ -92,7 +95,7 @@ This section shows how to distill a student model from a teacher model in the Me

This can be used stand-alone or after [Pruning](#pruning) / [Post-Training Quantization](#post-training-quantization) to recover accuracy of the model by distilling from the original model (teacher).

The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
The [distill.py](distill.py) script supports both standard HuggingFace checkpoints and [Puzzletron AnyModel](../puzzletron/README.md) checkpoints as student/teacher inputs. Just pass the checkpoint path via `--student_hf_path` / `--teacher_hf_path`. The distilled model is saved to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.

### Data Preparation

Expand Down Expand Up @@ -158,9 +161,22 @@ torchrun --nproc_per_node 8 distill.py \

To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.

### Convert Megatron checkpoint to Hugging Face format
### Converting to Hugging Face format (optional)

The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:

**Inline** -- add `--hf_export_path` and `--student_hf_model` to the `distill.py` command to automatically convert the final checkpoint after distillation:

```bash
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
... \
--hf_export_path /path/to/save/distilled_hf_ckpt \
--student_hf_model Qwen/Qwen3-4B
```

`--student_hf_model` should match the base architecture of the student (used as a template for export). For non-Puzzletron (i.e. standard) models, it should be same as `--student_hf_path`.

To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
**Separate conversion** -- convert any saved iteration using the Megatron-Bridge conversion script:
Comment thread
kevalmorabia97 marked this conversation as resolved.

```bash
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
Expand All @@ -169,7 +185,11 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
--hf-path <path_to_save_distilled_hf_ckpt>
```

For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).

### Distillation Results

See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.

## Post-Training Quantization

Expand Down
Loading
Loading