Commit 4a997bd
fix: MoE gate bias defaults and configurable gate_bias_update_factor (NVIDIA-NeMo#1768)
* fix: lower gate_bias_update_factor default to 1e-5 and fix MiniMax-M2 aux loss
Change default gate_bias_update_factor from 0.001 to 1e-5 across all MoE
models to be more conservative and avoid over-correction of expert routing.
For MiniMax-M2 specifically:
- Set aux_loss_coeff=0 explicitly since the model uses e_score_correction_bias
for load balancing, not auxiliary loss. The HF config class defaults
router_aux_loss_coef to 0.001 which is incorrect for this model.
- Make gate_bias_update_factor configurable via HF config attribute with
a default of 1e-5.
To configure gate_bias_update_factor for MiniMax-M2, set it on the HF config:
config.gate_bias_update_factor = <value>
For other models, pass a custom MoEConfig with the desired value.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: make gate_bias_update_factor configurable via model kwargs
- Pass **kwargs from MiniMaxM2ForCausalLM through to MiniMaxM2Model so
gate_bias_update_factor can be set via from_config(config, gate_bias_update_factor=...)
instead of relying on HF config getattr which MiniMaxM2Config drops.
- Update MiniMax-M2 test fixtures from 0.0 to 1e-5 to match new default.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* feat: make gate_bias_update_factor configurable via kwargs for all MoE models
Plumb **kwargs from outer ForCausalLM classes through to inner model
constructors so gate_bias_update_factor can be set at model creation time:
NeMoAutoModelForCausalLM.from_pretrained("model", gate_bias_update_factor=1e-4)
Each model preserves its original default (1e-5 for models with
e_score_correction_bias, 0.0 for others).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: revert kwargs configurability for models without e_score_correction_bias
Models that don't use e_score_correction_bias (qwen3_moe, qwen3_5_moe,
qwen3_next, qwen3_omni_moe, qwen3_vl_moe, gemma4_moe, nemotron_v3,
step3p5, gpt_oss) should keep gate_bias_update_factor=0.0 hardcoded
since configuring it has no effect without the bias buffer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: restore gate_bias_update_factor default to 1e-3 to preserve behavior
Change default from 1e-5 back to 1e-3 (0.001) to match the original
codebase value and avoid changing training behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: set train_gate=True for NemotronV3 MoE config
NemotronV3 has e_score_correction_bias in its checkpoint and needs
train_gate=True for proper gate bias updates during training.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: use explicit keyword arg for gate_bias_update_factor, add NemotronV3
- Replace kwargs.get("gate_bias_update_factor", ...) with an explicit
keyword-only argument on inner model __init__ signatures. This catches
typos at call time with TypeError instead of silently using the default.
- Outer ForCausalLM classes pop the kwarg and pass it explicitly.
- Add gate_bias_update_factor configurability to NemotronV3 (default 0.0)
since it has force_e_score_correction_bias=True.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: ruff formatting and update NemotronV3 test fixtures for train_gate=True
- Run ruff format on 5 model files that failed CI linting.
- Update NemotronV3 test fixtures from train_gate=False to True to match
the model change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* feat: replace explicit gate_bias_update_factor with moe_overrides dict
Replace the per-model gate_bias_update_factor keyword argument with a
generic moe_overrides dict that merges into the default MoEConfig. This
scales cleanly for future MoE knobs without adding args to every model.
Usage:
model = NeMoAutoModelForCausalLM.from_pretrained(
"model_path",
moe_overrides={"gate_bias_update_factor": 1e-4},
)
Applied to all 16 MoE models. Each model's defaults are preserved.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: validate moe_config/moe_overrides mutual exclusivity and add type annotations
- Raise ValueError when both moe_config and moe_overrides are passed to
prevent silent discard of overrides.
- Add consistent `dict | None = None` type annotation for moe_overrides
across all 16 MoE models.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* fix: forward moe_config from NemotronHForCausalLM to NemotronV3Model
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* test: add tests for moe_overrides code path
Cover the shared moe_overrides logic via MiniMax-M2:
- overrides merge into default MoEConfig
- unspecified fields preserve defaults
- passing both moe_config and moe_overrides raises ValueError
- overrides flow through ForCausalLM to inner model
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
* docs: add MoE defaults table for all custom models
Table showing score_func, aux_loss_coeff, gate_bias_update_factor, and
e_score_correction_bias status for each MoE model.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
---------
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent aa9574c commit 4a997bd
File tree
31 files changed
+367
-78
lines changed- nemo_automodel/components/models
- deepseek_v32
- deepseek_v3
- gemma4_moe
- glm4_moe_lite
- glm4_moe
- glm_moe_dsa
- gpt_oss
- minimax_m2
- mistral4
- nemotron_v3
- qwen3_5_moe
- qwen3_moe
- qwen3_next
- qwen3_omni_moe
- qwen3_vl_moe
- step3p5
- skills/model-onboarding
- tests
- functional_tests/hf_peft
- unit_tests/models
- deepseek_v32
- glm4_moe_lite
- glm4_moe
- glm_moe_dsa
- kimivl
- minimax_m2
- mistral4
- nemotron_v3
31 files changed
+367
-78
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
123 | 124 | | |
124 | 125 | | |
125 | 126 | | |
126 | 127 | | |
127 | | - | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
128 | 131 | | |
129 | 132 | | |
130 | 133 | | |
| |||
134 | 137 | | |
135 | 138 | | |
136 | 139 | | |
137 | | - | |
| 140 | + | |
138 | 141 | | |
139 | 142 | | |
140 | 143 | | |
141 | 144 | | |
142 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
143 | 149 | | |
144 | 150 | | |
145 | 151 | | |
| |||
269 | 275 | | |
270 | 276 | | |
271 | 277 | | |
272 | | - | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
273 | 285 | | |
274 | 286 | | |
275 | 287 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
| 86 | + | |
86 | 87 | | |
87 | 88 | | |
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
92 | | - | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
93 | 96 | | |
94 | 97 | | |
95 | 98 | | |
| |||
99 | 102 | | |
100 | 103 | | |
101 | 104 | | |
102 | | - | |
| 105 | + | |
103 | 106 | | |
104 | 107 | | |
105 | 108 | | |
106 | 109 | | |
107 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
108 | 114 | | |
109 | 115 | | |
110 | 116 | | |
| |||
170 | 176 | | |
171 | 177 | | |
172 | 178 | | |
173 | | - | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
174 | 186 | | |
175 | 187 | | |
176 | 188 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
248 | 248 | | |
249 | 249 | | |
250 | 250 | | |
| 251 | + | |
251 | 252 | | |
252 | 253 | | |
253 | 254 | | |
254 | 255 | | |
| 256 | + | |
| 257 | + | |
255 | 258 | | |
256 | 259 | | |
257 | 260 | | |
258 | 261 | | |
259 | | - | |
| 262 | + | |
260 | 263 | | |
261 | 264 | | |
262 | 265 | | |
| |||
274 | 277 | | |
275 | 278 | | |
276 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
277 | 283 | | |
278 | 284 | | |
279 | 285 | | |
| |||
452 | 458 | | |
453 | 459 | | |
454 | 460 | | |
| 461 | + | |
455 | 462 | | |
456 | 463 | | |
457 | 464 | | |
458 | 465 | | |
459 | 466 | | |
| 467 | + | |
460 | 468 | | |
461 | 469 | | |
462 | 470 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
96 | | - | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
97 | 104 | | |
98 | 105 | | |
99 | 106 | | |
| 107 | + | |
| 108 | + | |
100 | 109 | | |
101 | 110 | | |
102 | 111 | | |
103 | 112 | | |
104 | 113 | | |
105 | 114 | | |
106 | | - | |
| 115 | + | |
107 | 116 | | |
108 | 117 | | |
109 | 118 | | |
| |||
113 | 122 | | |
114 | 123 | | |
115 | 124 | | |
116 | | - | |
| 125 | + | |
117 | 126 | | |
118 | 127 | | |
119 | 128 | | |
| |||
123 | 132 | | |
124 | 133 | | |
125 | 134 | | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
126 | 138 | | |
127 | 139 | | |
128 | 140 | | |
| |||
238 | 250 | | |
239 | 251 | | |
240 | 252 | | |
241 | | - | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
242 | 260 | | |
243 | 261 | | |
244 | 262 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
105 | 112 | | |
106 | 113 | | |
107 | 114 | | |
| 115 | + | |
| 116 | + | |
108 | 117 | | |
109 | 118 | | |
110 | | - | |
| 119 | + | |
111 | 120 | | |
112 | 121 | | |
113 | 122 | | |
| |||
117 | 126 | | |
118 | 127 | | |
119 | 128 | | |
120 | | - | |
| 129 | + | |
121 | 130 | | |
122 | 131 | | |
123 | 132 | | |
| |||
127 | 136 | | |
128 | 137 | | |
129 | 138 | | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
130 | 142 | | |
131 | 143 | | |
132 | 144 | | |
| |||
239 | 251 | | |
240 | 252 | | |
241 | 253 | | |
242 | | - | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
243 | 261 | | |
244 | 262 | | |
245 | 263 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
97 | | - | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
98 | 105 | | |
99 | 106 | | |
100 | 107 | | |
| 108 | + | |
| 109 | + | |
101 | 110 | | |
102 | | - | |
| 111 | + | |
103 | 112 | | |
104 | 113 | | |
105 | 114 | | |
| |||
109 | 118 | | |
110 | 119 | | |
111 | 120 | | |
112 | | - | |
| 121 | + | |
113 | 122 | | |
114 | 123 | | |
115 | 124 | | |
| |||
119 | 128 | | |
120 | 129 | | |
121 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
122 | 134 | | |
123 | 135 | | |
124 | 136 | | |
| |||
227 | 239 | | |
228 | 240 | | |
229 | 241 | | |
230 | | - | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
231 | 249 | | |
232 | 250 | | |
233 | 251 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
92 | 99 | | |
93 | 100 | | |
94 | 101 | | |
| 102 | + | |
| 103 | + | |
95 | 104 | | |
96 | | - | |
| 105 | + | |
97 | 106 | | |
98 | 107 | | |
99 | 108 | | |
| |||
114 | 123 | | |
115 | 124 | | |
116 | 125 | | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
117 | 129 | | |
118 | 130 | | |
119 | 131 | | |
| |||
223 | 235 | | |
224 | 236 | | |
225 | 237 | | |
226 | | - | |
| 238 | + | |
| 239 | + | |
227 | 240 | | |
228 | 241 | | |
229 | 242 | | |
| |||
0 commit comments