NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 3 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎examples/llm_distill/main.py‎
Lines changed: 2 additions & 8 deletions b/‎examples/llm_distill/main.py‎
Lines changed: 2 additions & 8 deletions
diff --git a/‎examples/llm_qat/ARGUMENTS.md‎
Lines changed: 12 additions & 4 deletions b/‎examples/llm_qat/ARGUMENTS.md‎
Lines changed: 12 additions & 4 deletions
diff --git a/‎examples/llm_qat/README.md‎
Lines changed: 8 additions & 7 deletions b/‎examples/llm_qat/README.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎examples/llm_qat/arguments.py‎
Lines changed: 20 additions & 5 deletions b/‎examples/llm_qat/arguments.py‎
Lines changed: 20 additions & 5 deletions
diff --git a/‎examples/llm_qat/configs/train/finetune.yaml‎
Lines changed: 2 additions & 2 deletions b/‎examples/llm_qat/configs/train/finetune.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/llm_qat/configs/train/lr_config_example.yaml‎
Lines changed: 39 additions & 0 deletions b/‎examples/llm_qat/configs/train/lr_config_example.yaml‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎examples/llm_qat/configs/train/qad_nvfp4.yaml‎
Lines changed: 4 additions & 2 deletions b/‎examples/llm_qat/configs/train/qad_nvfp4.yaml‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎examples/llm_qat/configs/train/qat_nvfp4.yaml‎
Lines changed: 4 additions & 2 deletions b/‎examples/llm_qat/configs/train/qat_nvfp4.yaml‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎examples/llm_qat/configs/train/qlora_nvfp4.yaml‎
Lines changed: 3 additions & 1 deletion b/‎examples/llm_qat/configs/train/qlora_nvfp4.yaml‎
Lines changed: 3 additions & 1 deletion
@@ -6,6 +6,9 @@ Changelog
 
 **New Features**
 
+- Add model-agnostic `Liger kernel <https://github.com/linkedin/Liger-Kernel>`_ fused loss support in ``ModelOptHFTrainer`` for any HuggingFace causal LM, with distributed param gathering for FSDP2, DeepSpeed ZeRO-3, and DDP. Extends HuggingFace's built-in Liger integration which is limited to `a fixed set of model architectures <https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/monkey_patch.py>`_, FSDP only, and CrossEntropy loss. ModelOpt additionally supports Liger fused KD loss (JSD) for knowledge distillation.
+- Add ``ModelOptTrainerArguments`` to ``ModelOptHFTrainer`` with ``--trainable_params``, ``--frozen_params``, ``--lr_config``, ``--save_dtype``, and ``--manual_gc`` flags. Add per-parameter learning rate support via YAML config.
+- Simplify ``KDTrainer`` for HuggingFace knowledge distillation: remove ``mtd.convert()`` class-swap in favor of explicit teacher forwarding with logit-level distillation support.
 - Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
 - Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
 - Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
 
@@ -27,7 +27,7 @@
 from trl import SFTTrainer
 
 import modelopt.torch.opt as mto
-from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
+from modelopt.torch.distill.plugins.huggingface import KDTrainer
 
 logger = get_logger(__name__, log_level="INFO")
 
@@ -115,12 +115,6 @@ def train():
         model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
     )
 
-    # Distillation configuration
-    kd_config = {
-        "teacher_model": teacher_model,
-        "criterion": LMLogitsLoss(),
-    }
-
     # Fix problematic settings that logger.info excessive warnings
     model.generation_config.temperature = None
     model.generation_config.top_p = None
@@ -129,7 +123,7 @@ def train():
     trainer = KDSFTTrainer(
         model,
         training_args,
-        distill_config=kd_config,
+        distill_args={"teacher_model": teacher_model},
         train_dataset=dset_train,
         eval_dataset=dset_eval,
         formatting_func=lambda sample: _format_smoltalk_chat_template(sample, tokenizer),
 
@@ -7,8 +7,10 @@ _Auto-generated — do not edit by hand._
 | Argument | Type | Default | Description |
 |----------|------|---------|-------------|
 | `--distill` | `bool` | `False` | Enable training with knowledge distillation. |
-| `--teacher_model` | `str` | `None` | The name or path of the teacher model to use for distillation. |
+| `--teacher_model` | `str` | `None` | The name or path of the teacher model. |
 | `--criterion` | `str` | `"logits_loss"` | Distillation loss criterion. Currently only 'logits_loss' is supported. |
+| `--temperature` | `float` | `1.0` | Softmax temperature for softening logits in KD loss. Used by both standard and Liger KD loss. |
+| `--liger_jsd_beta` | `float` | `0.0` | JSD beta coefficient in [0, 1]. 0=forward KL, 1=reverse KL. Only used when --use_liger_kernel is enabled. |
 
 ## DataArguments
 
@@ -27,8 +29,9 @@ _Auto-generated — do not edit by hand._
 
 | Argument | Type | Default | Description |
 |----------|------|---------|-------------|
-| `--model_name_or_path` | `str` | `"meta-llama/Llama-2-7b-hf"` |  |
-| `--model_max_length` | `int` | `4096` | Maximum sequence length. Sequences will be right padded (and possibly truncated). |
+| `--model_name_or_path` | `str` | `"meta-llama/Llama-2-7b-hf"` | HuggingFace model ID or local path to a pretrained model. |
+| `--model_max_length` | `int` | `8192` | Maximum sequence length. Sequences will be right padded (and possibly truncated). |
+| `--attn_implementation` | `str` | `None` | Attention implementation: 'flash_attention_2', 'flash_attention_3', 'sdpa', or 'eager'. |
 
 ## QuantizeArguments
 
@@ -46,5 +49,10 @@ Extends [HuggingFace TrainingArguments](https://huggingface.co/docs/transformers
 
 | Argument | Type | Default | Description |
 |----------|------|---------|-------------|
-| `--cache_dir` | `str` | `None` |  |
+| `--trainable_params` | `list[str]` | `None` | Glob patterns (fnmatch) for parameters that should be trainable. All other parameters will be frozen. Mutually exclusive with frozen_params. |
+| `--frozen_params` | `list[str]` | `None` | Glob patterns (fnmatch) for parameters that should be frozen. Mutually exclusive with trainable_params. |
+| `--lr_config` | `str` | `None` | Path to a YAML file mapping fnmatch patterns to optimizer kwargs (e.g. lr, weight_decay). First matching pattern wins per parameter. See examples/llm_qat/configs/train/lr_config_example.yaml. |
+| `--save_dtype` | `str` | `"bfloat16"` | Dtype string to write into the saved model's config.json (e.g. 'bfloat16', 'float16'). Defaults to 'bfloat16'. |
+| `--manual_gc` | `bool` | `False` | Run `gc.collect()` before each training/prediction step to work around GPU memory leaks during QAT/distillation. |
+| `--liger_ce_label_smoothing` | `float` | `0.0` | Label smoothing for Liger fused CE loss. Only used when --use_liger_kernel is enabled. |
 | `--lora` | `bool` | `False` | Whether to add LoRA (Low-Rank Adaptation) adapter before training. When using real quantization, the LoRA adapter must be set, as quantized weights will be frozen during training. |
@@ -140,22 +140,23 @@ trainer.train()
 trainer.save_model()
 ```
 
-`QADTrainer` extends `QATTrainer` with distillation:
+`QADTrainer` extends `QATTrainer` with distillation. Pass the teacher model and a `DistillArguments` instance:
 
 ```python
-from modelopt.torch.distill.plugins.huggingface import LMLogitsLoss
+from modelopt.torch.distill.plugins.huggingface import DistillArguments
 from modelopt.torch.quantization.plugins.transformers_trainer import QADTrainer
 
-distill_config = {
-    "teacher_model": teacher_model,
-    "criterion": LMLogitsLoss(),
-}
+distill_args = DistillArguments(
+    distill=True,
+    teacher_model="Qwen/Qwen3-8B",
+    criterion="logits_loss",
+)
 
 trainer = QADTrainer(
     model=model,            # pre-quantized model
     processing_class=tokenizer,
     args=training_args,
-    distill_config=distill_config,
+    distill_args=distill_args,
     **data_module,
 )
 trainer.train()
 
@@ -19,19 +19,31 @@
 
 import transformers
 
-from modelopt.torch.opt.plugins.transformers import ModelOptHFArguments
+from modelopt.torch.opt.plugins.transformers import ModelOptHFArguments, ModelOptTrainerArguments
 
 
 class ModelArguments(ModelOptHFArguments):
-    model_name_or_path: str = field(default="meta-llama/Llama-2-7b-hf")
+    model_name_or_path: str = field(
+        default="meta-llama/Llama-2-7b-hf",
+        metadata={"help": "HuggingFace model ID or local path to a pretrained model."},
+    )
     model_max_length: int = field(
-        default=4096,
+        default=8192,
         metadata={
             "help": (
                 "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
             )
         },
     )
+    attn_implementation: str | None = field(
+        default=None,
+        metadata={
+            "help": (
+                "Attention implementation: 'flash_attention_2', 'flash_attention_3', "
+                "'sdpa', or 'eager'."
+            )
+        },
+    )
 
 
 class DataArguments(ModelOptHFArguments):
@@ -69,10 +81,13 @@ class DataArguments(ModelOptHFArguments):
     )
 
 
-class TrainingArguments(ModelOptHFArguments, transformers.TrainingArguments):
-    cache_dir: str | None = field(default=None)
+class TrainingArguments(ModelOptTrainerArguments, transformers.TrainingArguments):
     dataloader_drop_last: bool = field(default=True)
     bf16: bool = field(default=True)
+    use_liger_kernel: bool = field(
+        default=True,
+        metadata={"help": "Use Liger kernel for fused loss computation. Reduces memory usage."},
+    )
     lora: bool = field(
         default=False,
         metadata={
 
@@ -15,10 +15,10 @@ learning_rate: 1e-5
 per_device_train_batch_size: 2
 per_device_eval_batch_size: 2
 gradient_accumulation_steps: 2
-model_max_length: 4096
+model_max_length: 8192
 warmup_ratio: 0.05
 lr_scheduler_type: cosine
-gradient_checkpointing: true
+use_liger_kernel: true
 seed: 42
 
 # Evaluation
 
@@ -0,0 +1,39 @@
+# Per-parameter optimizer config example
+#
+# Maps fnmatch glob patterns to optimizer kwargs (lr, weight_decay, betas,
+# eps, etc.).  First matching pattern wins per parameter.  Parameters not
+# matching any pattern use the global values from the train config.
+#
+# Any keyword accepted by the optimizer constructor can be specified here.
+# Common kwargs for AdamW:
+#   lr           - learning rate
+#   weight_decay - L2 penalty (overrides the global --weight_decay)
+#   betas        - Adam momentum coefficients [beta1, beta2]
+#   eps          - term added to denominator for numerical stability
+#
+# Usage:
+#   --lr_config configs/train/lr_config_example.yaml
+#
+# Tip: use `model.named_parameters()` to find the exact parameter names
+# for your model.
+
+# Output head — lower LR, no weight decay
+"*lm_head*":
+  lr: 1e-5
+  weight_decay: 0.0
+
+# Attention layers — custom LR + more aggressive momentum
+"*self_attn*":
+  lr: 5e-5
+  betas: [0.9, 0.95]
+
+# MLP layers — custom LR + higher weight decay
+"*mlp*":
+  lr: 5e-5
+  weight_decay: 0.05
+
+# Embedding layers (often kept at a lower LR or frozen)
+"*embed_tokens*":
+  lr: 1e-6
+  weight_decay: 0.0
+  eps: 1e-7
@@ -3,6 +3,7 @@
 # Model
 model_name_or_path:  # e.g., Qwen/Qwen3-8B
 output_dir:  # e.g., qwen3-8b-qad-nvfp4
+attn_implementation: flash_attention_2
 
 # Quantization
 recipe: general/ptq/nvfp4_default-fp8_kv
@@ -22,10 +23,11 @@ learning_rate: 1e-5
 per_device_train_batch_size: 2
 per_device_eval_batch_size: 2
 gradient_accumulation_steps: 2
-model_max_length: 4096
+model_max_length: 8192
 warmup_ratio: 0.05
 lr_scheduler_type: cosine
-gradient_checkpointing: true
+use_liger_kernel: true
+manual_gc: true
 seed: 42
 do_train: true
 do_eval: true
 
@@ -3,6 +3,7 @@
 # Model
 model_name_or_path:  # e.g., Qwen/Qwen3-8B
 output_dir:  # e.g., qwen3-8b-qat-nvfp4
+attn_implementation: flash_attention_2
 
 # Quantization
 recipe: general/ptq/nvfp4_default-fp8_kv
@@ -18,10 +19,11 @@ learning_rate: 1e-5
 per_device_train_batch_size: 2
 per_device_eval_batch_size: 2
 gradient_accumulation_steps: 2
-model_max_length: 4096
+model_max_length: 8192
 warmup_ratio: 0.05
 lr_scheduler_type: cosine
-gradient_checkpointing: true
+use_liger_kernel: true
+manual_gc: true
 seed: 42
 do_train: true
 do_eval: true
 
@@ -22,10 +22,12 @@ learning_rate: 1e-3
 per_device_train_batch_size: 2
 per_device_eval_batch_size: 2
 gradient_accumulation_steps: 2
-model_max_length: 4096
+model_max_length: 8192
 warmup_ratio: 0.05
 lr_scheduler_type: cosine
 gradient_checkpointing: true
+use_liger_kernel: true
+manual_gc: true
 seed: 42
 do_train: true
 do_eval: true