Skip to content

Commit 999c999

Browse files
Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment (#1376)
### What does this PR do? Type of change: example/tutorial <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Add Nemotron-3-Nano-30B-A3B-BF16 e2e tutorial: Prune + Distill + Quantize + Nemo Evaluator + vLLM deployment <img width="2079" height="1613" alt="image" src="https://github.com/user-attachments/assets/19b6ab82-7f01-45df-a0a5-d1c3282b384a" /> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * End-to-end Nemotron-3-Nano-30B tutorial: pruning, two‑phase distillation, FP8 PTQ, evaluation, and vLLM deployment; new ablations and long‑context analyses. * Distillation CLI: configurable seed and activation‑recomputation options. * **Bug Fixes** * Preprocessing hardened to skip malformed JSONL and normalize tool‑call argument formats. * **Documentation** * Many README/examples/evaluator docs updated (news list, tokenization guides, tutorials, configs, and deployment notes). * **Tests** * Added test verifying preprocessing handles stringified tool‑call arguments. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1376?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 5d0441a commit 999c999

17 files changed

Lines changed: 1280 additions & 93 deletions

File tree

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Changelog
2828
- Add composable ``$import`` system for recipe YAML configs, enabling reusable config snippets referenced via ``{$import: name}`` markers. All built-in PTQ recipes converted to use imports with shared snippets under ``modelopt_recipes/configs/`` (numeric formats, quant_cfg building blocks, presets). See :ref:`composable-imports`.
2929
- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
3030
- Add support for ``active_params`` (for MoE models) and ``memory_mb`` constraints in Minitron pruning on top of existing ``params`` constraint. You can also provide multiple constraints. See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details. The underlying utility functions ``mcore_param_count``, ``mcore_memory_footprint_mb``, and ``print_mcore_model_stats`` in ``modelopt.torch.nas.plugins.megatron_model_stats`` are also available for standalone use to compute parameter counts and memory footprints (weights + KV-cache + Mamba state) for any Megatron-Core model.
31+
- Add end-to-end tutorial for Minitron pruning + two-phase distillation (80B @ 8K + 20B @ 32K long-context = 100B tokens) + FP8 PTQ + vLLM deployment for Nemotron-3-Nano-30B-A3B-BF16 (MoE + Mamba-Transformer hybrid) → Pruned 22B/A3.0B active params, along with data blend preparation steps (with tool-calling data) and detailed pruning / data-blend / long-context ablations. See `examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/>`_ for details.
3132
- Add ``--cast_mxfp4_to_nvfp4`` flag to ``examples/llm_ptq/hf_ptq.py`` for closed-form, bit-exact MXFP4 → NVFP4 weight conversion. Supports the GPT-OSS family (``openai/gpt-oss-20b``, ``openai/gpt-oss-120b``). See `examples/llm_ptq/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq#mxfp4--nvfp4-cast-for-gpt-oss>`__ for usage.
3233
- DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
3334
- Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.

README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,10 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
2626

2727
## Latest News
2828

29-
- [2026/05/13] **Pruning & NAS News**
30-
- [**Puzzletron**](./examples/puzzletron): A new algorithm for heterogeneous pruning & NAS of LLM and VLM models.
31-
- [**End-to-end Minitron workflow**](./examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2): Pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → pruned 7B, including data blend preparation and an ablation study.
32-
- Latest customer stories on compression:
33-
- [Bielik.AI showcases an open European sovereign AI model at NVIDIA GTC](https://bielik.ai/en/nvidia-gtc-bielik-minitron-premiere/)
34-
- [Domyn-Large: The journey of a European sovereign AI model for regulated industries](https://www.domyn.com/blog/domyn-large-the-journey-of-a-european-sovereign-ai-model-for-regulated-industries)
29+
- [2026/05/27] [**End-to-end Minitron workflow for Nemotron-3-Nano-30B-A3B**](./examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16): Pruning + two-phase distillation + FP8 quantization achieving 1.64× vLLM throughput and 2.6× memory reduction.
30+
- [2026/05/13] [**Puzzletron**](./examples/puzzletron): A new algorithm for heterogeneous pruning & NAS of LLM and VLM models.
31+
- [2026/04/15] Customer story: [Domyn compresses Colosseum-355B → 260B using ModelOpt's Minitron pruning + distillation](https://www.domyn.com/blog/domyn-large-the-journey-of-a-european-sovereign-ai-model-for-regulated-industries)
32+
- [2026/03/17] Customer story: [Bielik.AI builds Bielik Minitron 7B (33% smaller, 50% faster, 90% quality retained) using ModelOpt's Minitron pruning + distillation](https://bielik.ai/en/nvidia-gtc-bielik-minitron-premiere/)
3533
- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
3634
- [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
3735
- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)

examples/dataset/MEGATRON_DATA_PREP.md

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,8 @@ Tokenization commands for all Nemotron Pre-Training and Post-Training datasets u
9797
Two parameters vary by model — set them before running the commands below:
9898

9999
```bash
100-
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
101-
OUTPUT_DIR=tokenized_nemotron_v2 # Output directory for tokenized files
100+
TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # HuggingFace tokenizer (or local path)
101+
OUTPUT_DIR=tokenized_nemotron_3 # Output directory for tokenized files
102102
```
103103

104104
> [!TIP]
@@ -154,13 +154,14 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
154154

155155
Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `<think>…</think>` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.
156156

157-
**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately:
157+
**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately. `--hf_streaming` is required because the messages contain extra fields (e.g. `tool_calls`) that cause Arrow type-cast errors in non-streaming mode when using tokenizers with complex chat templates (such as Nemotron v3):
158158

159159
```bash
160160
for SPLIT in high_part00 high_part01; do
161161
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
162162
--hf_dataset nvidia/Nemotron-Math-v2 \
163163
--hf_split ${SPLIT} \
164+
--hf_streaming \
164165
--json_keys messages \
165166
--tokenizer ${TOKENIZER} \
166167
--output_dir ${OUTPUT_DIR} \
@@ -170,6 +171,26 @@ for SPLIT in high_part00 high_part01; do
170171
done
171172
```
172173

174+
**[nvidia/Nemotron-SFT-Math-v3](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Math-v3)** — stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested `tool_calls` fields):
175+
176+
```bash
177+
hf download nvidia/Nemotron-SFT-Math-v3 \
178+
--repo-type dataset \
179+
--local-dir datasets/Nemotron-SFT-Math-v3/
180+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
181+
--jsonl_paths datasets/Nemotron-SFT-Math-v3/data/train.jsonl \
182+
--json_keys messages \
183+
--tokenizer ${TOKENIZER} \
184+
--output_dir ${OUTPUT_DIR} \
185+
--workers 96 \
186+
--max_sequence_length 256_000 \
187+
--reasoning_content inline
188+
189+
# Rename to avoid generic file name
190+
mv ${OUTPUT_DIR}/train_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.bin
191+
mv ${OUTPUT_DIR}/train_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.idx
192+
```
193+
173194
**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
174195

175196
```bash
@@ -220,6 +241,26 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
220241
--reasoning_content inline
221242
```
222243

244+
**[nvidia/Nemotron-Agentic-v1](https://huggingface.co/datasets/nvidia/Nemotron-Agentic-v1)**`tool_calling.jsonl` (316K samples). Stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested `tool_calls` / `tools` fields):
245+
246+
```bash
247+
hf download nvidia/Nemotron-Agentic-v1 \
248+
--repo-type dataset \
249+
--local-dir datasets/Nemotron-Agentic-v1/
250+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
251+
--jsonl_paths datasets/Nemotron-Agentic-v1/data/tool_calling.jsonl \
252+
--json_keys messages \
253+
--tokenizer ${TOKENIZER} \
254+
--output_dir ${OUTPUT_DIR} \
255+
--workers 96 \
256+
--max_sequence_length 256_000 \
257+
--reasoning_content inline
258+
259+
# Rename to avoid collision with potential future Nemotron-SFT-Agentic-v2 / tool_calling
260+
mv ${OUTPUT_DIR}/tool_calling_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-Agentic-v1_tool_calling_messages.bin
261+
mv ${OUTPUT_DIR}/tool_calling_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-Agentic-v1_tool_calling_messages.idx
262+
```
263+
223264
---
224265

225266
### Expected output
@@ -233,10 +274,12 @@ nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bi
233274
nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
234275
nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
235276
nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
277+
nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
236278
competitive_programming_python_00_messages.{bin,idx}
237279
competitive_programming_cpp_00_messages.{bin,idx}
238280
MCQ_messages.{bin,idx}
239281
RQA_messages.{bin,idx}
240282
reasoning_off_messages.{bin,idx}
241283
reasoning_on_messages.{bin,idx}
284+
nvidia--Nemotron-Agentic-v1_tool_calling_messages.{bin,idx}
242285
```

examples/megatron_bridge/README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,17 +35,11 @@ docker run \
3535
--rm -it \
3636
-v ${MODELOPT_DIR}:/opt/Model-Optimizer \
3737
-v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
38+
-v ${MODELOPT_DIR}/modelopt_recipes:/opt/venv/lib/python3.12/site-packages/modelopt_recipes \
3839
-w /opt/Model-Optimizer/examples/megatron_bridge \
3940
${DOCKER_IMAGE} bash
4041
```
4142

42-
Once inside the container, you need to login with your HuggingFace token to download gated datasets / models.
43-
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
44-
45-
```bash
46-
hf auth login --token <your token>
47-
```
48-
4943
> [!WARNING]
5044
> Use `python -m pip` instead of `pip` to avoid conflicts with the system-wide installed packages in the NeMo containers. You may also refer to this [doc](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docker/common/README.md#installing-packages-inside-the-container) on how to correctly install packages in the NeMo containers without breaking existing torch installation.
5145
@@ -55,6 +49,13 @@ Also install additional dependencies from the [requirements.txt](./requirements.
5549
python -m pip install -r requirements.txt
5650
```
5751

52+
You also need to login with your HuggingFace token to download gated datasets / models.
53+
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
54+
55+
```bash
56+
hf auth login --token <your token>
57+
```
58+
5859
## Pruning
5960

6061
This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).

examples/megatron_bridge/distill.py

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,6 @@
5656
with contextlib.suppress(ModuleNotFoundError):
5757
import modelopt.torch.puzzletron.plugins.mbridge # noqa: F401
5858

59-
SEED = 1234
60-
6159

6260
def _patched_to_cfg_dict(self):
6361
"""Patched DistillationProvider.to_cfg_dict method for heterogeneous teacher and student models.
@@ -117,6 +115,12 @@ def get_args():
117115
parser.add_argument("--etp_size", type=int, default=1, help="Expert tensor parallel size")
118116

119117
# Dataset arguments
118+
parser.add_argument(
119+
"--seed",
120+
type=int,
121+
default=1234,
122+
help="Random seed for data shuffling and RNG state",
123+
)
120124
parser.add_argument(
121125
"--data_paths",
122126
nargs="+",
@@ -153,6 +157,34 @@ def get_args():
153157
parser.add_argument("--lr", type=float, default=1e-4, help="Peak learning rate")
154158
parser.add_argument("--min_lr", type=float, default=1e-5, help="Minimum learning rate")
155159
parser.add_argument("--lr_warmup_iters", type=int, default=50, help="Number of LR warmup steps")
160+
parser.add_argument(
161+
"--recompute_granularity",
162+
type=str,
163+
default=None,
164+
choices=["selective", "full"],
165+
help="Activation recomputation: omit (off), 'selective' (attn only), 'full' (whole layers)",
166+
)
167+
parser.add_argument(
168+
"--recompute_method",
169+
type=str,
170+
default=None,
171+
choices=["uniform", "block"],
172+
help="Activation recomputation method (only used when --recompute_granularity=full)",
173+
)
174+
parser.add_argument(
175+
"--recompute_num_layers",
176+
type=int,
177+
default=None,
178+
help="Number of layers per recomputation chunk (only used when --recompute_granularity=full)",
179+
)
180+
parser.add_argument(
181+
"--recompute_modules",
182+
type=str,
183+
nargs="+",
184+
default=None,
185+
help="Modules to recompute with --recompute_granularity=selective. Defaults to ['core_attn']. "
186+
"Allowed: core_attn, mlp, moe, moe_act, layernorm, mla_up_proj, shared_experts.",
187+
)
156188
parser.add_argument(
157189
"--eval_interval", type=int, default=100, help="Validate + checkpoint every <N> steps"
158190
)
@@ -219,6 +251,12 @@ def _build_model_provider(hf_path):
219251
provider.expert_model_parallel_size = args.ep_size
220252
provider.expert_tensor_parallel_size = args.etp_size
221253
provider.seq_length = args.seq_length
254+
if args.recompute_granularity is not None:
255+
provider.recompute_granularity = args.recompute_granularity
256+
provider.recompute_method = args.recompute_method
257+
provider.recompute_num_layers = args.recompute_num_layers
258+
if args.recompute_modules is not None:
259+
provider.recompute_modules = args.recompute_modules
222260
return provider
223261

224262
# TODO: Support megatron-ckpt as an alternative to HF checkpoints (e.g. /path/to/ckpt/iter_0000000)
@@ -246,7 +284,7 @@ def _build_model_provider(hf_path):
246284
dataset_kwargs = {
247285
"seq_length": args.seq_length,
248286
"path_to_cache": args.data_path_to_cache,
249-
"random_seed": SEED,
287+
"random_seed": args.seed,
250288
"reset_attention_mask": False,
251289
"reset_position_ids": False,
252290
"eod_mask_loss": False,
@@ -308,7 +346,7 @@ def _build_model_provider(hf_path):
308346
async_save=True,
309347
fully_parallel_save=True,
310348
),
311-
rng=RNGConfig(seed=SEED),
349+
rng=RNGConfig(seed=args.seed),
312350
mixed_precision="bf16_mixed",
313351
)
314352

examples/pruning/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,7 +294,8 @@ After pruning, distillation is required to recover model accuracy. Below are rec
294294

295295
End-to-end distillation results with Megatron-Bridge after Minitron and Puzzletron pruning:
296296

297-
- **[Minitron — Nemotron-Nano-9B-v2](minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md)**: End-to-end tutorial of structured pruning for Nemotron-Nano-9B-v2 to 7B followed by knowledge distillation up to 80B tokens, quantization, and vLLM deployment. Achieves near-parity with the official 9B model across popular pretraining and reasoning benchmarks.
297+
- **[Minitron — Nemotron-3-Nano-30B-A3B-BF16](minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md)***recommended — newer and most comprehensive*: End-to-end tutorial of structured pruning for Nemotron-3-Nano-30B-A3B-BF16 (31.6B/A3.6B) to 22B/A3.0B active parameters followed by two-phase knowledge distillation (80B tokens @ 8K seq length + 20B tokens @ 32K seq length = 100B tokens total), quantization, and vLLM deployment. Covers MoE + Mamba-Transformer hybrid, tool-calling data, and a long-context fine-tuning phase. Achieves near-parity with the official 30B model across popular pretraining and reasoning benchmarks while delivering up to 1.64× throughput speedup and 2.6× memory reduction when combined with FP8 quantization.
298+
- **[Minitron — Nemotron-Nano-9B-v2](minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md)**: Earlier end-to-end tutorial covering structured pruning of the dense Mamba-Transformer Nemotron-Nano-9B-v2 to 7B followed by knowledge distillation up to 80B tokens, quantization, and vLLM deployment. Simpler architecture, single-phase 8K seq length distillation, no tool-calling or long-context phase.
298299
- **[Puzzletron — Qwen3-8B and Llama-3.1-8B-Instruct](puzzletron/Llama-3.1-8B-Instruct.md)**: MIP-based compression followed by short distillation runs on WikiText-103. Shows MMLU recovery and illustrates the importance of using larger datasets to avoid overfitting.
299300

300301
## Resources

0 commit comments

Comments
 (0)