Skip to content
Merged
Show file tree
Hide file tree
Changes from 94 commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
6c038f9
Add modelopt/torch/_compress CODEOWNERS
kevalmorabia97 Oct 27, 2025
230cee1
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 27, 2025
54c5f0f
Remove llm_ptq example tests from CICD
kevalmorabia97 Oct 27, 2025
9eeee25
E2E test for the experimental compress algorithm based on https://arx…
danielkorzekwa Oct 28, 2025
ad1d18e
Merge branch 'main' into feature/compress
kevalmorabia97 Oct 28, 2025
cef3655
Add convert_llama3_config_to_decilm_config + unit test (#465)
danielkorzekwa Oct 29, 2025
002b8b5
Implement nas.convert() api for the compress algorithm (#482)
danielkorzekwa Oct 31, 2025
1c12fd8
modelopt nas search() implementation for the compress algorithm (#490)
danielkorzekwa Nov 3, 2025
f7d547f
Add decilm modelling code (#505)
danielkorzekwa Nov 12, 2025
50a580c
Compress tutorial (PoC) (#492)
danielkorzekwa Nov 12, 2025
b121945
Add llama converter (no dependency on internal Nvidia code) - part 1/…
danielkorzekwa Nov 13, 2025
866e400
llama converter is self-contained now (no dependency on internal nvid…
danielkorzekwa Nov 14, 2025
0868f1c
Add integration test for attention pruning (#562)
danielkorzekwa Nov 14, 2025
69726cc
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
07ca24d
Merge branch 'main' into feature/compress
kevalmorabia97 Nov 15, 2025
1dde209
Add score_pruning_activations (step 2/6) (#563)
danielkorzekwa Nov 18, 2025
2e559e7
Update README.md
kevalmorabia97 Nov 18, 2025
f10be0d
Add activation hooks used for pruning (#576)
danielkorzekwa Nov 20, 2025
194b532
Add sewing kit and utilities used for pruning scoring - pruning scori…
danielkorzekwa Nov 24, 2025
8c9cdd4
Add L2NormHook and use it in megatron.py (#599)
danielkorzekwa Nov 26, 2025
1f72466
Add pruning checkpoints for the compress algorithm (#607)
danielkorzekwa Nov 27, 2025
97fe7f0
Add build replacement library to the compress algorithm. (#616)
danielkorzekwa Dec 1, 2025
954103e
Add subblock stats to the compress algorithm (#623)
danielkorzekwa Dec 1, 2025
dcc425f
Add 1-block scoring to the compress algorithm (#625)
danielkorzekwa Dec 2, 2025
56d95de
Add checkpoint save/load to ForwardHook + add IterativeChannelContrib…
danielkorzekwa Dec 2, 2025
74aae83
Add MIP step to the compress algorithm (#627)
danielkorzekwa Dec 4, 2025
a1f63bc
Merge branch 'main' into feature/compress
kevalmorabia97 Dec 8, 2025
a99f503
Remove unused mip functions + fix multi-gpu test (#660)
kevalmorabia97 Dec 8, 2025
67489f4
Fix a bug in IterativeChannelContributionHook + tools for activation …
danielkorzekwa Dec 11, 2025
1d8bd20
Remove runtime.py and directly use torch dist utils + remove unused f…
kevalmorabia97 Dec 11, 2025
f7a0cb0
Use shared activation hooks component in the puzzle algorithm (#687)
danielkorzekwa Dec 17, 2025
db866d9
Clean up Puzzle Compress Tutorial (#711)
LianaMikael Dec 22, 2025
2e813bf
Two bug fixes: mix checkpointing and dtype (#718)
danielkorzekwa Dec 22, 2025
83ac3b1
Merge remote-tracking branch 'origin/main' into feature/compress
kevalmorabia97 Jan 13, 2026
0eecfc6
Fix test assertions for 2-gpu (#772)
kevalmorabia97 Jan 13, 2026
43b3cfa
Rename compress to puzzletron (#776)
kevalmorabia97 Jan 14, 2026
4c30bd5
Add NeMo Conversion Scripts to Puzzletron (#784)
LianaMikael Jan 15, 2026
96bb0ba
Merge branch 'main' into feature/compress
kevalmorabia97 Mar 3, 2026
8c84fee
[CI] Update to only run puzzletron tests
kevalmorabia97 Mar 3, 2026
5812777
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 3, 2026
5f77c81
Pin torchprofile==0.0.4 to fix CI
kevalmorabia97 Mar 10, 2026
82df595
Add anymodel-core to feature/puzzletron (#974)
danielkorzekwa Mar 11, 2026
4dc9932
Draft: anymodel activation scoring (#989)
danielkorzekwa Mar 12, 2026
d358eb3
Draft: Merge anymodel pruning (#990)
danielkorzekwa Mar 12, 2026
8e827f3
Draft: Merging anymodel:build_library_and_stats (#993)
danielkorzekwa Mar 12, 2026
eb4b210
Draft: merge any model calc one block scores (#994)
danielkorzekwa Mar 12, 2026
8fe318d
Draft: merge any_model: mip_and_realize_models (#995)
danielkorzekwa Mar 13, 2026
2fbdf0e
Update uv.lock for nspect puzzletron scanning
kevalmorabia97 Mar 13, 2026
1b42f0b
Dkorzekwa/any model other models (#1007)
danielkorzekwa Mar 17, 2026
67999eb
Dkorzekwa/anymodel gptoss (#1020)
danielkorzekwa Mar 17, 2026
660dc17
Merge any_model tutorial (#1035)
danielkorzekwa Mar 19, 2026
01cba6a
Merge mbridge distillation for any_model (#1036)
danielkorzekwa Mar 20, 2026
2b6572c
MR branch for the remaining difference between dkorzekwa/any_model an…
danielkorzekwa Mar 20, 2026
110316a
Dkorzekwa/decilm hf code cleanup (#1071)
danielkorzekwa Mar 23, 2026
4190275
Dkorzekwa/decilm hf code cleanup 2 (#1073)
danielkorzekwa Mar 23, 2026
0708ca2
Dkorzekwa/anymodel subblock stats (#1085)
danielkorzekwa Mar 24, 2026
3193f30
Dkorzekwa/anymodel subblock stats nodecilm (#1102)
danielkorzekwa Mar 24, 2026
928036e
Dkorzekwa/decilm cleanup post subblockstats (#1103)
danielkorzekwa Mar 24, 2026
e508b76
code clean up (#1110)
danielkorzekwa Mar 24, 2026
f460d16
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 25, 2026
2f55c73
Dkorzekwa/puzzletron use importance hooks from prune (#1115)
danielkorzekwa Mar 25, 2026
c5ec50b
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Mar 25, 2026
d257871
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Mar 30, 2026
7e15fdd
Revert CICD and other config changes
kevalmorabia97 Mar 30, 2026
d0209dc
Make Qwen and QwenVL descriptor generic so can be used for other vari…
kevalmorabia97 Mar 25, 2026
d987bad
Set strict=True in distill_hf export
kevalmorabia97 Mar 30, 2026
75651cc
add basic ruff fixes
kevalmorabia97 Mar 25, 2026
03118ce
Apply coderabbit suggestions
kevalmorabia97 Mar 30, 2026
2a170b9
Set weights_only=True in checkpoint_utils.py
kevalmorabia97 Mar 30, 2026
d6f8ddb
More fixes
kevalmorabia97 Mar 30, 2026
4621b65
reuse puzzletron tokenizer in other tests
kevalmorabia97 Mar 30, 2026
be4bd3a
disable puzzletron in coverage check as its covered in gpu tests only
kevalmorabia97 Mar 30, 2026
45426ca
Remove custom DistillationProvider and simplify mbridge distillation …
kevalmorabia97 Apr 1, 2026
5429d86
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 1, 2026
41b8ca7
fix test
kevalmorabia97 Apr 2, 2026
33b9230
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 7, 2026
25266b8
fix hydra config dtype resolution in puzzletron validation tools (#1202)
j-rausch Apr 8, 2026
fd5694d
Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_ev…
j-rausch Apr 9, 2026
dedcad0
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Apr 10, 2026
d0cdbfd
minor cleanup
kevalmorabia97 Apr 10, 2026
ee01ace
Move block_config out of deci_lm_hf_code folder
kevalmorabia97 Apr 10, 2026
7ce3332
Fix critical bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
c7700a9
Fix critical and major bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
9f3cc2d
Fix minor bugs flagged by codeRabbit in PR #1121
kevalmorabia97 Apr 10, 2026
66fccd2
fix decoder_layer_cls failure on trust_remote_code models (#1222)
j-rausch Apr 10, 2026
7053c61
fix puzzletron container test path; add NeMo setup docs (#1231)
j-rausch Apr 10, 2026
05c6d3b
Add MoE/Nemotron fixes to support Transformers 5.5
kevalmorabia97 Apr 10, 2026
0d6eb7e
Update changelog
kevalmorabia97 Apr 10, 2026
ac8397b
Refactor puzzletron imports: relative imports, public API, logger fix
kevalmorabia97 Apr 10, 2026
977d60a
Merge remote-tracking branch 'origin/main' into feature/puzzletron
kevalmorabia97 Apr 10, 2026
01a4e55
Fix custom tiny tokenizer
kevalmorabia97 Apr 11, 2026
f361ca6
Fix `RuntimeError: pidfd_getfd: Operation not permitted`
kevalmorabia97 Apr 11, 2026
a7eedf8
Add __all__ for modules
kevalmorabia97 Apr 11, 2026
ed5fd68
Fix test_puzzletron assertions for transformers v5.5
kevalmorabia97 Apr 11, 2026
547e76d
Fix doc building
kevalmorabia97 Apr 11, 2026
62070ae
Fix Qwen2.5 test assertion as per CI machine
kevalmorabia97 Apr 13, 2026
38d9522
Address coderabbit comments
kevalmorabia97 Apr 13, 2026
6395b1e
copy custom modeling files to pruned checkpoint dirs (#1245)
j-rausch Apr 13, 2026
d88dfcb
consolidate mbridge distillation: merge distill_hf.py into distill.py…
j-rausch Apr 13, 2026
06eaf74
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 13, 2026
3f41819
Address minor coderabbit comments
kevalmorabia97 Apr 13, 2026
ad8cf9a
fix lm-eval version conflict in puzzletron requirements (#1257)
j-rausch Apr 15, 2026
5e4c43e
fix hybrid model subblock param counting: all FFN sizes reported iden…
j-rausch Apr 15, 2026
2345af7
Merge branch 'main' into feature/puzzletron
kevalmorabia97 Apr 15, 2026
e0bb89d
Fix test
kevalmorabia97 Apr 15, 2026
47a612e
Fix test path
kevalmorabia97 Apr 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
modelopt/torch/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
modelopt/torch/speculative @NVIDIA/modelopt-torch-speculative-codeowners
Expand All @@ -49,6 +50,7 @@ modelopt_recipes @NVIDIA/modelopt-recipes-codeowners
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
/examples/puzzletron @NVIDIA/modelopt-torch-puzzletron-codeowners
/examples/specdec_bench @NVIDIA/modelopt-torch-speculative-codeowners
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
/examples/torch_onnx @NVIDIA/modelopt-onnx-codeowners
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/_example_tests_runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ jobs:
- name: Install dependencies
run: |
# use `python -m pip` instead of `pip` to avoid conflicts with system pip for nemo containers
pip uninstall -y nvidia-modelopt
python -m pip install ".${{ inputs.pip_install_extras }}"

if [[ "${{ inputs.example }}" == *"diffusers"* ]]; then
Expand All @@ -64,7 +65,7 @@ jobs:
COVERAGE_FILE: ${{ github.workspace }}/.coverage
run: |
echo "Running tests for: ${{ inputs.example }}"
pytest tests/examples/${{ inputs.example }} --cov
python -m pytest tests/examples/${{ inputs.example }} --cov
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v5
with:
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/example_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,14 @@ jobs:
strategy: &nemo_strategy
fail-fast: false
matrix:
example: [megatron_bridge]
example: [megatron_bridge, puzzletron]
uses: ./.github/workflows/_example_tests_runner.yml
secrets: inherit
with:
docker_image: "nvcr.io/nvidia/nemo:26.02"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
pip_install_extras: "[hf,puzzletron,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-1

nemo-non-pr:
Expand All @@ -144,7 +144,7 @@ jobs:
docker_image: "nvcr.io/nvidia/nemo:26.02"
example: ${{ matrix.example }}
timeout_minutes: 30
pip_install_extras: "[hf,dev-test]"
pip_install_extras: "[hf,puzzletron,dev-test]"
runner: linux-amd64-gpu-rtxpro6000-latest-2

##### ONNX/TensorRT Example Tests #####
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
matrix:
include:
- example: gpu
timeout: 45
timeout: 60
container_image: pytorch:26.01-py3
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
- example: gpu-megatron
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ repos:
modelopt/onnx/quantization/ort_patching.py|
modelopt/torch/_deploy/utils/onnx_utils.py|
modelopt/torch/export/transformer_engine.py|
modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_pruned_to_mxfp4.py|
modelopt/torch/quantization/export_onnx.py|
modelopt/torch/quantization/plugins/attention.py|
modelopt/torch/sparsity/attention_sparsity/methods/vsa_utils.py|
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Changelog
**New Features**

- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
- Add skip-softmax skipping to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
Expand Down
9 changes: 9 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
# import sys
# sys.path.insert(0, os.path.abspath('.'))

import contextlib
import os
import sys

Expand All @@ -44,6 +45,14 @@
sys.path.insert(0, os.path.abspath("../../"))
sys.path.append(os.path.abspath("./_ext"))

# Pre-import modelopt.torch so it is cached in sys.modules before Sphinx applies
# autodoc_mock_imports. Mocking triton/tensorrt_llm at the Sphinx level can break
# transitive imports (transformers, transformer_engine, …) and cause modelopt.torch
# to fail inside autosummary. Importing here — while the real packages are still on
# sys.path — avoids that problem entirely.
with contextlib.suppress(Exception):
import modelopt.torch # noqa: F401

# -- Project information -----------------------------------------------------

project = "Model Optimizer" # pylint: disable=C0103
Expand Down
16 changes: 16 additions & 0 deletions examples/llm_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,22 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
--batch_size 4
```

### Heterogeneous Pruned Checkpoints (Puzzletron)

Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:

```sh
python lm_eval_hf.py --model hf \
--model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

For a quick smoke test, add `--limit 10`.

> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).

### Quantized (simulated)

- For simulated quantization with any of the default quantization formats:
Expand Down
46 changes: 44 additions & 2 deletions examples/llm_eval/lm_eval_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import warnings

import datasets
Expand All @@ -50,9 +51,29 @@
from modelopt.torch.quantization.utils import is_quantized
from modelopt.torch.sparsity.attention_sparsity.conversion import is_attn_sparsified

try:
import modelopt.torch.puzzletron as mtpz

_ANYMODEL_AVAILABLE = True
except ImportError:
_ANYMODEL_AVAILABLE = False


def _anymodel_patcher_context(pretrained, trust_remote_code=False):
"""Return a deci_x_patcher context if *pretrained* is a Puzzletron checkpoint, else a no-op."""
if not _ANYMODEL_AVAILABLE or not pretrained:
return contextlib.nullcontext()
try:
descriptor = mtpz.anymodel.resolve_descriptor_from_pretrained(
pretrained, trust_remote_code=trust_remote_code
)
except (ValueError, AttributeError):
return contextlib.nullcontext()
return mtpz.anymodel.deci_x_patcher(model_descriptor=descriptor)


def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
"""Overrides the HFLM.create_from_arg_obj"""
"""Override HFLM.create_from_arg_obj to add quantization, sparsity, and Puzzletron support."""

quant_cfg = arg_dict.pop("quant_cfg", None)
auto_quantize_bits = arg_dict.pop("auto_quantize_bits", None)
Expand All @@ -72,7 +93,10 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
# Enable automatic save/load of modelopt state huggingface checkpointing
mto.enable_huggingface_checkpointing()

model_obj = cls(**arg_dict, **additional_config)
with _anymodel_patcher_context(
arg_dict.get("pretrained"), arg_dict.get("trust_remote_code", False)
):
model_obj = cls(**arg_dict, **additional_config)
model_obj.tokenizer.padding_side = "left"
if is_quantized(model_obj.model):
# return if model is already quantized
Expand Down Expand Up @@ -109,10 +133,28 @@ def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict |
return model_obj


def create_from_arg_string(
cls: type[T], arg_string: str, additional_config: dict | None = None
) -> T:
"""Override HFLM.create_from_arg_string to support Puzzletron checkpoints."""
args = utils.simple_parse_args_string(arg_string)
additional_config = {} if additional_config is None else additional_config
args2 = {k: v for k, v in additional_config.items() if v is not None}

mto.enable_huggingface_checkpointing()

with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
model_obj = cls(**args, **args2)

return model_obj


HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
HFLM.create_from_arg_string = classmethod(create_from_arg_string)


def setup_parser_with_modelopt_args():
"""Extend the lm-eval argument parser with ModelOpt quantization and sparsity options."""
parser = setup_parser()
parser.add_argument(
"--quant_cfg",
Expand Down
1 change: 1 addition & 0 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Pruning can involve removal (prune) of Linear and Conv layers; and Transformer a
This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:

1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT (and later extended to Mamba, MoE, and Hybrid Transformer Mamba) models in NVIDIA Megatron-LM (M-LM) or Megatron-Bridge (M-Bridge) framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
1. [Puzzletron](../puzzletron/README.md): An advanced pruning method by NVIDIA using Mixed Integer Programming (MIP) based NAS search algorithm.
1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.

Expand Down
14 changes: 14 additions & 0 deletions examples/puzzletron/GPTOSS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@

## GptOss

With this release Puzzle algorithm supports only experts removal for `Gpt-Oss`.

This model comes as a quantized checkpoint i.e. MoE experts matrices are quantized with _MXFP4_ format.
In the pruning steps puzzle utilizes decompressed model (back to BF16) for statistics and scores computation.
This means, during the conversion to puzzle format we decompress the model and store it as a BF16.
Once the pruning is done i.e. experts to be removed are identified and the process is finished, user may want to get back the _MXFP4_ format of the checkpoint.
To do so, there is an additional script, that takes the original and the pruned checkpoint and outputs pruned checkpoint in _MXFP4_ format.

```bash
python -m modelopt.torch.puzzletron.anymodel.models.gpt_oss.gpt_oss_pruned_to_mxfp4 --student-path /workspaces/any_model_gpt_oss/mip/puzzle_solutions/stats_num_params_18014757184/solutions--checkpoints/solution_0/ --original-path /workspaces/source_model_checkpoints/openai_gpt-oss-20b/ --output-path /workspaces/any_model_gpt_oss/mip/puzzle_solutions/stats_num_params_18014757184/solutions--checkpoints/mxfp4-ckpt/ --num-layers 24
```
Loading
Loading