Skip to content

Commit 3350b0a

Browse files
authored
[OMNIML-3017] MLM QAD example (#682)
## What does this PR do? **Type of change:** New example <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** ### Add QAD Training example for Megatron-LM - File Structure - qad.sh / sbatch_qad.sh - Training and SLURM submission scripts - data_utils/ - Dataset download and preprocessing utilities - configs/ - Configuration templates for Qwen3-30B-A3B (MoE) and Qwen3-8B (Dense) - Key Features - One-button dataset generation (OpenScience + Nemotron-v2) - Config-based training scripts, keep all tunable knobs into a single config file ## Usage <!-- You can potentially add a usage example below. --> 1. Generate dataset ```bash bash data_utils/generate_dataset.sh \ --output-dir /path/to/datasets \ --mlm-path /path/to/Megatron-LM \ --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507 ``` 2. Create a config based on templates 3. Kick off training with Slurm: ```bash sbatch sbatch_qad.sh --config configs/my-experiment.conf ``` ## Testing <!-- Mention how have you tested your change if applicable. --> QAD with Qwen3-30B-A3B-instruct-2507 NVFP4 (all layers quantized) - GPQA: BF16: 0.549 NVFP4 (PTQ): 0.4949 NVFP4 (QAD): 0.5202 - Livecodebench: BF16: 0.3987 NVFP4 (PTQ): 0.37 NVFP4 (QAD): 0.3855 - Scicode: BF16: 0.325 NVFP4 (PTQ): 0.276 NVFP4 (QAD): 0.3146 - AIME BF16: 0.6049 NVFP4 (PTQ): 0.55 NVFP4 (QAD): 0.5431 ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> --------- Signed-off-by: Wei-Ming Chen <weimingc@login-eos01.eos.clusters.nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
1 parent 03dc386 commit 3350b0a

File tree

8 files changed

+1137
-0
lines changed

8 files changed

+1137
-0
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
1313
- Add support for subgraphs in ONNX autocast.
1414
- Add support for parallel draft heads in Eagle speculative decoding.
1515
- Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
16+
- Add ``examples/llm_qad`` for QAD training with Megatron-LM.
1617

1718
**Deprecations**
1819

examples/llm_qad/README.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# QAD Training Scripts
2+
3+
Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
4+
5+
## Overview
6+
7+
| Script | Purpose |
8+
|--------|---------|
9+
| `qad.sh` | Main training script (run inside container) |
10+
| `sbatch_qad.sh` | SLURM batch submission wrapper |
11+
| `configs/*.conf` | Model-specific configuration files |
12+
13+
## Requirements
14+
15+
### Clone Required Repositories
16+
17+
```bash
18+
# Set your workspace directory
19+
export WORKSPACE=/path/to/your/workspace
20+
21+
# Clone Megatron-LM (with ModelOpt integration)
22+
git clone https://github.com/NVIDIA/Megatron-LM.git ${WORKSPACE}/Megatron-LM
23+
24+
# Clone Model-Optimizer
25+
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git ${WORKSPACE}/Model-Optimizer
26+
```
27+
28+
### Prepare Checkpoints
29+
30+
You need the following checkpoints before training:
31+
32+
1. **Student checkpoint**: Quantized (e.g., NVFP4) model in Megatron-LM format
33+
2. **Teacher checkpoint**: Full-precision (BF16) model in Megatron-LM format
34+
3. **Teacher config YAML**: Model architecture configuration
35+
36+
See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) for checkpoint conversion from HuggingFace format.
37+
38+
## Creating a Configuration
39+
40+
### Available Templates
41+
42+
| Config | Model | Type |
43+
|--------|-------|------|
44+
| `qwen3-30b-a3b-instruct-2507-moe_template.conf` | Qwen3-30B-A3B-Instruct | MoE |
45+
| `qwen3-8b_template.conf` | Qwen3-8B | Dense |
46+
47+
### Create Your Config
48+
49+
1. Copy a template:
50+
51+
```bash
52+
# For MoE models
53+
cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf
54+
55+
# For Dense models
56+
cp configs/qwen3-8b_template.conf configs/my-experiment.conf
57+
```
58+
59+
2. Fill in required fields:
60+
61+
**Checkpoints** (required):
62+
63+
| Variable | Description |
64+
|----------|-------------|
65+
| `STUDENT_CKPT` | Path to quantized student MLM checkpoint |
66+
| `TEACHER_CKPT` | Path to teacher MLM checkpoint |
67+
| `TEACHER_MODEL_CONFIG` | Path to teacher YAML config (see below) |
68+
69+
**Paths** (required):
70+
71+
| Variable | Description |
72+
|----------|-------------|
73+
| `MLM_DIR` | Path to Megatron-LM directory |
74+
| `BLEND_PATH` | Path to datablend JSON (from dataset generation) |
75+
76+
**Parallelism** (adjust for your hardware):
77+
78+
| Variable | Dense Model | MoE Model |
79+
|----------|-------------|-----------|
80+
| `IS_MOE` | `false` | `true` |
81+
| `TP_SIZE` | `1` | `2` |
82+
| `EP_SIZE` | `1` | `4` |
83+
| `MBS` | `4` | `2` |
84+
85+
**Training** (tune as needed):
86+
87+
| Variable | Default | Description |
88+
|----------|---------|-------------|
89+
| `LR` | `1e-5` | Learning rate |
90+
| `GBS` | `256` | Global batch size |
91+
| `SAVE_INTERVAL` | `200` | Checkpoint interval |
92+
93+
### Teacher Model Config (YAML)
94+
95+
Create a YAML file with teacher model architecture (example: `configs/Qwen3-30B-A3B-teacher.yaml`):
96+
97+
```yaml
98+
num_layers: 48
99+
hidden_size: 2048
100+
num_attention_heads: 32
101+
num_query_groups: 4
102+
kv_channels: 128
103+
ffn_hidden_size: 6144
104+
```
105+
106+
## Dataset Generation
107+
108+
Use the one-button script to generate the default datablend:
109+
110+
```bash
111+
cd data_utils/
112+
113+
bash generate_dataset.sh \
114+
--output-dir /path/to/datasets \
115+
--mlm-path /path/to/Megatron-LM \
116+
--tokenizer <HF-model> # e.g., Qwen/Qwen3-30B-A3B-Instruct-2507
117+
```
118+
119+
**Requirements**: HuggingFace token for `nvidia/Nemotron-Post-Training-Dataset-v2`. Login first: `huggingface-cli login`
120+
121+
**Output**: Creates `datablend_combined.json` with OpenScience + Nemotron-v2 datasets. Set `BLEND_PATH` in your config to point to this file.
122+
123+
## Quick Start
124+
125+
### SLURM Batch Submission (Recommended)
126+
127+
First, update `sbatch_qad.sh` SLURM header with your cluster settings:
128+
129+
- `--account=<your-account>`
130+
- `--nodes`, `--gres=gpu`, `-t` as needed
131+
132+
```bash
133+
# Submit training job (override account on command line)
134+
sbatch --account=<your-account> sbatch_qad.sh --config configs/my-experiment.conf
135+
136+
# With HuggingFace token (for gated models)
137+
sbatch --account=<your-account> sbatch_qad.sh --hf-token $HF_TOKEN --config configs/my-experiment.conf
138+
139+
# Adjust nodes and time
140+
sbatch --account=<your-account> --nodes=4 -t 8:00:00 sbatch_qad.sh --config configs/my-experiment.conf
141+
```
142+
143+
### Interactive Mode
144+
145+
```bash
146+
# Get interactive node
147+
srun -A <account> --nodes=1 -p batch --mpi=pmix \
148+
--container-image=nvcr.io/nvidia/pytorch:25.06-py3 \
149+
--container-mounts="..." \
150+
-t 4:0:0 --pty bash
151+
152+
# Run training
153+
bash qad.sh --config configs/qwen3-8b.conf
154+
```
155+
156+
## Resuming Training
157+
158+
Training automatically resumes from checkpoints. To force a fresh start:
159+
160+
```bash
161+
rm -rf /path/to/checkpoints/*/latest_checkpointed_iteration.txt
162+
```
163+
164+
## Troubleshooting
165+
166+
### OOM Errors
167+
168+
- Reduce `MBS`
169+
- Increase `EP_SIZE`, `TP_SIZE`, `PP_SIZE`
170+
- Add more nodes
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/bin/bash
2+
########################################################
3+
# QAD Configuration: Qwen3-30B-A3B Instruct (MoE)
4+
# Mixture of Experts - requires more resources
5+
#
6+
# Usage:
7+
# sbatch sbatch_qad.sh --config configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
8+
########################################################
9+
10+
########################################################
11+
# MODEL
12+
########################################################
13+
export STUDENT_MODEL="Qwen3-30B-A3B-Instruct-2507"
14+
export TEACHER_MODEL="Qwen3-30B-A3B-Instruct-2507"
15+
export TOKENIZER_MODEL="Qwen/Qwen3-30B-A3B-Instruct-2507"
16+
17+
########################################################
18+
# CHECKPOINTS (REQUIRED)
19+
########################################################
20+
export STUDENT_CKPT="" # Student MLM checkpoint path
21+
export TEACHER_CKPT="" # Teacher MLM checkpoint path
22+
export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file, e.g., configs/Qwen3-30B-A3B-teacher.yaml
23+
24+
########################################################
25+
# TRAINING (REQUIRED - no defaults in qwen_qad.sh)
26+
########################################################
27+
export LR="5e-6"
28+
export GBS=64
29+
export MIN_LR="1e-8"
30+
export LR_DECAY_STYLE="cosine"
31+
export SAVE_INTERVAL=200
32+
export LOG_INTERVAL=10
33+
export DATASET_NAME="openscience_nemotron" # use for logging
34+
export TRAIN_SAMPLES=5120000
35+
36+
########################################################
37+
# PARALLELISM
38+
# Note: QAD loads both student + teacher models, requires more memory
39+
########################################################
40+
export TP_SIZE=2
41+
export PP_SIZE=1
42+
export MBS=2
43+
export NUM_GPUS=4
44+
export MASTER_PORT=29500
45+
46+
########################################################
47+
# MOE
48+
########################################################
49+
export EP_SIZE=4
50+
export IS_MOE=false
51+
52+
########################################################
53+
# PATHS (REQUIRED - no defaults in qwen_qad.sh)
54+
########################################################
55+
export MLM_DIR="" # path to Megatron-LM source directory
56+
export MODELOPT_DIR="" # path to Model-Optimizer source directory
57+
export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-30B-A3B.sh
58+
export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
59+
export DATACACHE_DIR="" # path to data cache directory
60+
61+
########################################################
62+
# CONTAINER
63+
########################################################
64+
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
65+
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
66+
export CONTAINER_WORKDIR="" # container work directory, e.g., "<path-to-modelopt>/Model-Optimizer/examples/llm_qad"
67+
68+
69+
########################################################
70+
# DATASET
71+
########################################################
72+
# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
73+
export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
#!/bin/bash
2+
########################################################
3+
# QAD Configuration: Qwen3-8B (Dense Model)
4+
#
5+
# Usage:
6+
# sbatch sbatch_qad.sh --config configs/qwen3-8b_template.conf
7+
########################################################
8+
9+
########################################################
10+
# MODEL
11+
########################################################
12+
export STUDENT_MODEL="Qwen3-8B"
13+
export TEACHER_MODEL="Qwen3-8B"
14+
export TOKENIZER_MODEL="Qwen/Qwen3-8B"
15+
16+
########################################################
17+
# CHECKPOINTS (REQUIRED)
18+
########################################################
19+
export STUDENT_CKPT="" # Student MLM checkpoint path
20+
export TEACHER_CKPT="" # Teacher MLM checkpoint path
21+
export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file
22+
23+
########################################################
24+
# TRAINING
25+
########################################################
26+
export LR="5e-6"
27+
export GBS=64
28+
export MIN_LR="1e-8"
29+
export LR_DECAY_STYLE="cosine"
30+
export SAVE_INTERVAL=200
31+
export LOG_INTERVAL=10
32+
export DATASET_NAME="openscience_nemotron" # use for logging
33+
export TRAIN_SAMPLES=5120000
34+
35+
########################################################
36+
# PARALLELISM (Dense model - simpler settings)
37+
########################################################
38+
export TP_SIZE=1
39+
export PP_SIZE=1
40+
export MBS=4
41+
export NUM_GPUS=8
42+
export MASTER_PORT=29500
43+
44+
########################################################
45+
# MOE
46+
########################################################
47+
export EP_SIZE=1
48+
export IS_MOE=false
49+
50+
########################################################
51+
# PATHS (REQUIRED)
52+
########################################################
53+
export MLM_DIR="" # path to Megatron-LM source directory
54+
export MODELOPT_DIR="" # path to Model-Optimizer source directory
55+
export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
56+
export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
57+
export DATACACHE_DIR="" # path to data cache directory
58+
59+
########################################################
60+
# CONTAINER
61+
########################################################
62+
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
63+
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
64+
export CONTAINER_WORKDIR="" # container work directory
65+
66+
########################################################
67+
# DATASET
68+
########################################################
69+
# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
70+
export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
71+

0 commit comments

Comments
 (0)