You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add ModelOptHFTrainer and simplify KDTrainer distillation API
Add ModelOptHFTrainer with per-layer LR config, param freezing, Liger
fused loss, save dtype rewriting, and manual GC support. Refactor llm_qat
example configs and improve test coverage. Unify KDTrainer interface:
replace separate teacher_model + DistillArguments params with a single
distill_args dict/dataclass. Remove LMLogitsLoss wrapper in favor of
LogitsDistillationLoss directly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,9 @@ Changelog
6
6
7
7
**New Features**
8
8
9
+
- Add model-agnostic `Liger kernel <https://github.com/linkedin/Liger-Kernel>`_ fused loss support in ``ModelOptHFTrainer`` for any HuggingFace causal LM, with distributed param gathering for FSDP2, DeepSpeed ZeRO-3, and DDP. Extends HuggingFace's built-in Liger integration which is limited to `a fixed set of model architectures <https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/monkey_patch.py>`_, FSDP only, and CrossEntropy loss. ModelOpt additionally supports Liger fused KD loss (JSD) for knowledge distillation.
10
+
- Add ``ModelOptTrainerArguments`` to ``ModelOptHFTrainer`` with ``--trainable_params``, ``--frozen_params``, ``--lr_config``, ``--save_dtype``, and ``--manual_gc`` flags. Add per-parameter learning rate support via YAML config.
11
+
- Simplify ``KDTrainer`` for HuggingFace knowledge distillation: remove ``mtd.convert()`` class-swap in favor of explicit teacher forwarding with logit-level distillation support.
9
12
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
10
13
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
11
14
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
|`--trainable_params`|`list[str]`|`None`| Glob patterns (fnmatch) for parameters that should be trainable. All other parameters will be frozen. Mutually exclusive with frozen_params. |
53
+
|`--frozen_params`|`list[str]`|`None`| Glob patterns (fnmatch) for parameters that should be frozen. Mutually exclusive with trainable_params. |
54
+
|`--lr_config`|`str`|`None`| Path to a YAML file mapping fnmatch patterns to optimizer kwargs (e.g. lr, weight_decay). First matching pattern wins per parameter. See examples/llm_qat/configs/train/lr_config_example.yaml. |
55
+
|`--save_dtype`|`str`|`"bfloat16"`| Dtype string to write into the saved model's config.json (e.g. 'bfloat16', 'float16'). Defaults to 'bfloat16'. |
56
+
|`--manual_gc`|`bool`|`False`| Run `gc.collect()` before each training/prediction step to work around GPU memory leaks during QAT/distillation. |
57
+
|`--liger_ce_label_smoothing`|`float`|`0.0`| Label smoothing for Liger fused CE loss. Only used when --use_liger_kernel is enabled. |
50
58
|`--lora`|`bool`|`False`| Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |
0 commit comments