Skip to content

Commit de30e03

Browse files
realAsmaclaude
andcommitted
Refactor llm_qat example: YAML configs + ModelOptArgParser
Replace launch.sh + main.py with config-driven quantize.py and train.py using ModelOptArgParser (--config YAML defaults + CLI overrides). Add a YAML-driven dataset blending system with streaming, distributed sharding, and tokenization caching. Introduce declarative configs/ for train, dataset, and accelerate, plus a pre-commit hook that regenerates ARGUMENTS.md. Move the QAT memory workaround into QATTrainer, add resolve_quant_cfg_from_args(), a teacher-model loading utility for QAD, and the int4_blockwise_weight_only recipe. Default model is now Qwen3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: realAsma <akuriparambi@nvidia.com>
1 parent 8f96832 commit de30e03

42 files changed

Lines changed: 2895 additions & 1028 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.pre-commit-config.yaml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ repos:
109109
examples/llm_eval/lm_eval_hf.py|
110110
examples/llm_eval/mmlu.py|
111111
examples/llm_eval/modeling.py|
112-
examples/llm_qat/main.py|
112+
examples/llm_qat/train.py|
113113
examples/llm_sparsity/weight_sparsity/finetune.py|
114114
examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
115115
examples/speculative_decoding/main.py|
@@ -137,6 +137,21 @@ repos:
137137
args: ["-c", "pyproject.toml", "-q"]
138138
additional_dependencies: ["bandit[toml]"]
139139

140+
- repo: local
141+
hooks:
142+
- id: generate-arguments-md
143+
name: Regenerate examples/llm_qat/ARGUMENTS.md
144+
entry: bash -c 'python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md'
145+
language: system
146+
files: >-
147+
(?x)^(
148+
examples/llm_qat/arguments\.py|
149+
modelopt/torch/distill/plugins/huggingface\.py|
150+
modelopt/torch/opt/plugins/transformers\.py|
151+
modelopt/torch/quantization/plugins/transformers_trainer\.py
152+
)$
153+
pass_filenames: false
154+
140155
- repo: https://github.com/DavidAnson/markdownlint-cli2
141156
rev: v0.18.1
142157
hooks:

CHANGELOG.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ Changelog
4242
- Add mixed-precision FP8 + NVFP4 export for Megatron-Core: per-layer ``quant_algo`` recorded under ``quantized_layers`` in ``hf_quant_config.json``, PP-aware ``kv_cache_dtype`` gather, fused-QKV exclude split into per-HF-name ``q/k/v_proj`` entries.
4343
- Add Nemotron-3-Super-120B-A12B PTQ recipes ``modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`` (MSE-mixed) and ``super-nvfp4-max-calib.yaml`` (max-calib mixed): NVFP4 W4A4 routed experts + FP8 per-tensor shared experts / Mamba in/out_proj + FP8 KV cache.
4444
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
45+
- Refactor ``llm_qat`` example with unified YAML-based configuration and flexible dataset blending.
46+
``ModelOptArgParser`` adds ``--config`` YAML support with CLI overrides and auto-generates ``ARGUMENTS.md`` from dataclass definitions.
47+
Dataset blending (``configs/dataset/blend.yaml``) supports HuggingFace datasets, local JSON/JSONL/Parquet files, and weighted multi-source blends.
48+
The legacy FSDP1 accelerate config is removed; ``llm_qat`` now documents FSDP2, DeepSpeed, and DDP backends.
4549

4650
**Bug Fixes**
4751

examples/llm_qad/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
44

5+
> **Note:** For Hugging Face LLM QAD, see the [LLM QAT QAD section](../llm_qat/README.md#end-to-end-qad-example).
6+
57
## Overview
68

79
| Script | Purpose |

examples/llm_qat/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.cache/
2+
.dataset_cache/

examples/llm_qat/ARGUMENTS.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Argument Reference
2+
3+
<!-- Auto-generated — do not edit by hand. Regenerate with: python examples/llm_qat/arguments.py --generate_docs examples/llm_qat/ARGUMENTS.md -->
4+
5+
## DistillArguments
6+
7+
| Argument | Type | Default | Description |
8+
|----------|------|---------|-------------|
9+
| `--distill` | `bool` | `False` | Enable training with knowledge distillation. |
10+
| `--teacher_model` | `str` | `None` | The name or path of the teacher model to use for distillation. |
11+
| `--criterion` | `str` | `"logits_loss"` | Distillation loss criterion. Currently only 'logits_loss' is supported. |
12+
13+
## DataArguments
14+
15+
| Argument | Type | Default | Description |
16+
|----------|------|---------|-------------|
17+
| `--dataset_config` | `str` | `"configs/dataset/blend.yaml"` | Path to a dataset blend YAML config file. |
18+
| `--train_samples` | `int` | `20000` | Number of training samples to use. |
19+
| `--eval_samples` | `int` | `2000` | Number of evaluation samples to use. |
20+
| `--dataset_seed` | `int` | `42` | Random seed for dataset shuffling. |
21+
| `--dataset_cache_dir` | `str` | `".dataset_cache/tokenized"` | Directory for caching tokenized datasets. |
22+
| `--shuffle` | `bool` | `True` | Whether to shuffle dataset sources (reservoir sampling). |
23+
| `--shuffle_buffer` | `int` | `10000` | Buffer size for streaming shuffle. |
24+
| `--num_proc` | `int` | `16` | Number of CPU workers for tokenization. |
25+
26+
## ModelArguments
27+
28+
| Argument | Type | Default | Description |
29+
|----------|------|---------|-------------|
30+
| `--model_name_or_path` | `str` | `"meta-llama/Llama-2-7b-hf"` | HuggingFace model name or local path to the base model to quantize/train. |
31+
| `--model_max_length` | `int` | `4096` | Maximum sequence length. Sequences will be right-padded (and possibly truncated). |
32+
33+
## QuantizeArguments
34+
35+
| Argument | Type | Default | Description |
36+
|----------|------|---------|-------------|
37+
| `--recipe` | `str` | `None` | Path to a quantization recipe YAML file (built-in or custom). Built-in recipes can be specified by relative path, e.g. 'general/ptq/nvfp4_default-kv_fp8'. Replaces the deprecated --quant_cfg flag. |
38+
| `--quant_cfg` | `modelopt.torch.quantization.config.QuantizeConfig` | `None` | Deprecated: pre-quantize the model with a separate quantization step instead. Specify the quantization format for PTQ/QAT by name (e.g. NVFP4_DEFAULT_CFG). |
39+
| `--calib_size` | `int` | `512` | Specify the calibration size for quantization. The calibration dataset is used to setup the quantization scale parameters for PTQ/QAT. |
40+
| `--compress` | `bool` | `False` | Whether to compress the model weights after quantization for QLoRA. This is useful for reducing the model size. |
41+
| `--calib_batch_size` | `int` | `1` | Batch size for calibration data during quantization. |
42+
| `--output_dir` | `str` | `"quantized_model"` | Directory to save the quantized model checkpoint. |
43+
44+
## TrainingArguments
45+
46+
Extends [HuggingFace TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). Only additional arguments are shown below.
47+
48+
| Argument | Type | Default | Description |
49+
|----------|------|---------|-------------|
50+
| `--cache_dir` | `str` | `None` | |
51+
| `--lora` | `bool` | `False` | Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |

0 commit comments

Comments
 (0)