Skip to content

Commit 4a997bd

Browse files
hemildesaiclaude
authored andcommitted
fix: MoE gate bias defaults and configurable gate_bias_update_factor (NVIDIA-NeMo#1768)
* fix: lower gate_bias_update_factor default to 1e-5 and fix MiniMax-M2 aux loss Change default gate_bias_update_factor from 0.001 to 1e-5 across all MoE models to be more conservative and avoid over-correction of expert routing. For MiniMax-M2 specifically: - Set aux_loss_coeff=0 explicitly since the model uses e_score_correction_bias for load balancing, not auxiliary loss. The HF config class defaults router_aux_loss_coef to 0.001 which is incorrect for this model. - Make gate_bias_update_factor configurable via HF config attribute with a default of 1e-5. To configure gate_bias_update_factor for MiniMax-M2, set it on the HF config: config.gate_bias_update_factor = <value> For other models, pass a custom MoEConfig with the desired value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: make gate_bias_update_factor configurable via model kwargs - Pass **kwargs from MiniMaxM2ForCausalLM through to MiniMaxM2Model so gate_bias_update_factor can be set via from_config(config, gate_bias_update_factor=...) instead of relying on HF config getattr which MiniMaxM2Config drops. - Update MiniMax-M2 test fixtures from 0.0 to 1e-5 to match new default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * feat: make gate_bias_update_factor configurable via kwargs for all MoE models Plumb **kwargs from outer ForCausalLM classes through to inner model constructors so gate_bias_update_factor can be set at model creation time: NeMoAutoModelForCausalLM.from_pretrained("model", gate_bias_update_factor=1e-4) Each model preserves its original default (1e-5 for models with e_score_correction_bias, 0.0 for others). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: revert kwargs configurability for models without e_score_correction_bias Models that don't use e_score_correction_bias (qwen3_moe, qwen3_5_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe, gemma4_moe, nemotron_v3, step3p5, gpt_oss) should keep gate_bias_update_factor=0.0 hardcoded since configuring it has no effect without the bias buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: restore gate_bias_update_factor default to 1e-3 to preserve behavior Change default from 1e-5 back to 1e-3 (0.001) to match the original codebase value and avoid changing training behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: set train_gate=True for NemotronV3 MoE config NemotronV3 has e_score_correction_bias in its checkpoint and needs train_gate=True for proper gate bias updates during training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: use explicit keyword arg for gate_bias_update_factor, add NemotronV3 - Replace kwargs.get("gate_bias_update_factor", ...) with an explicit keyword-only argument on inner model __init__ signatures. This catches typos at call time with TypeError instead of silently using the default. - Outer ForCausalLM classes pop the kwarg and pass it explicitly. - Add gate_bias_update_factor configurability to NemotronV3 (default 0.0) since it has force_e_score_correction_bias=True. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: ruff formatting and update NemotronV3 test fixtures for train_gate=True - Run ruff format on 5 model files that failed CI linting. - Update NemotronV3 test fixtures from train_gate=False to True to match the model change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * feat: replace explicit gate_bias_update_factor with moe_overrides dict Replace the per-model gate_bias_update_factor keyword argument with a generic moe_overrides dict that merges into the default MoEConfig. This scales cleanly for future MoE knobs without adding args to every model. Usage: model = NeMoAutoModelForCausalLM.from_pretrained( "model_path", moe_overrides={"gate_bias_update_factor": 1e-4}, ) Applied to all 16 MoE models. Each model's defaults are preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: validate moe_config/moe_overrides mutual exclusivity and add type annotations - Raise ValueError when both moe_config and moe_overrides are passed to prevent silent discard of overrides. - Add consistent `dict | None = None` type annotation for moe_overrides across all 16 MoE models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * fix: forward moe_config from NemotronHForCausalLM to NemotronV3Model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * test: add tests for moe_overrides code path Cover the shared moe_overrides logic via MiniMax-M2: - overrides merge into default MoEConfig - unspecified fields preserve defaults - passing both moe_config and moe_overrides raises ValueError - overrides flow through ForCausalLM to inner model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * docs: add MoE defaults table for all custom models Table showing score_func, aux_loss_coeff, gate_bias_update_factor, and e_score_correction_bias status for each MoE model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> --------- Signed-off-by: hemildesai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent aa9574c commit 4a997bd

File tree

31 files changed

+367
-78
lines changed

31 files changed

+367
-78
lines changed

nemo_automodel/components/models/deepseek_v3/model.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -120,11 +120,14 @@ def __init__(
120120
backend: BackendConfig,
121121
*,
122122
moe_config: MoEConfig | None = None,
123+
moe_overrides: dict | None = None,
123124
):
124125
super().__init__()
125126
self.backend = backend
126127
self.config = config
127-
self.moe_config = moe_config or MoEConfig(
128+
if moe_config is not None and moe_overrides is not None:
129+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
130+
moe_defaults = dict(
128131
dim=config.hidden_size,
129132
inter_dim=config.intermediate_size,
130133
moe_inter_dim=config.moe_intermediate_size,
@@ -134,12 +137,15 @@ def __init__(
134137
n_expert_groups=config.n_group,
135138
n_limited_groups=config.topk_group,
136139
train_gate=True,
137-
gate_bias_update_factor=0.001,
140+
gate_bias_update_factor=1e-3,
138141
score_func="sigmoid",
139142
route_scale=config.routed_scaling_factor,
140143
aux_loss_coeff=0,
141144
norm_topk_prob=config.norm_topk_prob,
142145
)
146+
if moe_overrides:
147+
moe_defaults.update(moe_overrides)
148+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
143149
self.embed_tokens = nn.Embedding(
144150
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
145151
)
@@ -269,7 +275,13 @@ def __init__(
269275
super().__init__()
270276
self.config = config
271277
self.backend = backend or BackendConfig()
272-
self.model = DeepseekV3Model(config, backend=self.backend, moe_config=moe_config)
278+
moe_overrides = kwargs.pop("moe_overrides", None)
279+
self.model = DeepseekV3Model(
280+
config,
281+
backend=self.backend,
282+
moe_config=moe_config,
283+
moe_overrides=moe_overrides,
284+
)
273285
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
274286
if self.backend.enable_hf_state_dict_adapter:
275287
self.state_dict_adapter = DeepSeekV3StateDictAdapter(

nemo_automodel/components/models/deepseek_v32/model.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,13 +83,16 @@ def __init__(
8383
backend: BackendConfig,
8484
*,
8585
moe_config: MoEConfig | None = None,
86+
moe_overrides: dict | None = None,
8687
):
8788
# Call grandparent __init__ to skip DeepseekV3Model's __init__
8889
nn.Module.__init__(self)
8990

9091
self.backend = backend
9192
self.config = config
92-
self.moe_config = moe_config or MoEConfig(
93+
if moe_config is not None and moe_overrides is not None:
94+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
95+
moe_defaults = dict(
9396
dim=config.hidden_size,
9497
inter_dim=config.intermediate_size,
9598
moe_inter_dim=config.moe_intermediate_size,
@@ -99,12 +102,15 @@ def __init__(
99102
n_expert_groups=config.n_group,
100103
n_limited_groups=config.topk_group,
101104
train_gate=True,
102-
gate_bias_update_factor=0.001,
105+
gate_bias_update_factor=1e-3,
103106
score_func="sigmoid",
104107
route_scale=config.routed_scaling_factor,
105108
aux_loss_coeff=0,
106109
norm_topk_prob=config.norm_topk_prob,
107110
)
111+
if moe_overrides:
112+
moe_defaults.update(moe_overrides)
113+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
108114

109115
self.embed_tokens = nn.Embedding(
110116
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
@@ -170,7 +176,13 @@ def __init__(
170176
self.config = config
171177
self.backend = backend or BackendConfig()
172178
# Use V3.2 Model instead of V3 Model
173-
self.model = DeepseekV32Model(config, backend=self.backend, moe_config=moe_config)
179+
moe_overrides = kwargs.pop("moe_overrides", None)
180+
self.model = DeepseekV32Model(
181+
config,
182+
backend=self.backend,
183+
moe_config=moe_config,
184+
moe_overrides=moe_overrides,
185+
)
174186
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
175187
if self.backend.enable_hf_state_dict_adapter:
176188
# Use V3.2 adapter instead of V3 adapter

nemo_automodel/components/models/gemma4_moe/model.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,15 +248,18 @@ def __init__(
248248
backend: BackendConfig,
249249
*,
250250
moe_config: MoEConfig | None = None,
251+
moe_overrides: dict | None = None,
251252
):
252253
super().__init__()
253254
self.backend = backend
254255
self.config = config
256+
if moe_config is not None and moe_overrides is not None:
257+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
255258

256259
self.padding_idx = getattr(config, "pad_token_id", None)
257260
self.vocab_size = config.vocab_size
258261

259-
self.moe_config = moe_config or MoEConfig(
262+
moe_defaults = dict(
260263
dim=config.hidden_size,
261264
inter_dim=config.intermediate_size,
262265
moe_inter_dim=config.expert_intermediate_size or getattr(config, "moe_intermediate_size", None),
@@ -274,6 +277,9 @@ def __init__(
274277
expert_activation="geglu",
275278
softmax_before_topk=False,
276279
)
280+
if moe_overrides:
281+
moe_defaults.update(moe_overrides)
282+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
277283

278284
get_dtype(getattr(config, "torch_dtype", None), torch.bfloat16)
279285
self.embed_tokens = Gemma4TextScaledWordEmbedding(
@@ -452,11 +458,13 @@ def __init__(
452458
return
453459

454460
# --- MoE path: replace the text model ---
461+
moe_overrides = kwargs.pop("moe_overrides", None)
455462
self.model.__class__ = Gemma4MoEModel
456463
self.model.language_model = Gemma4MoETextModelBackend(
457464
text_config,
458465
backend=self.backend,
459466
moe_config=moe_config,
467+
moe_overrides=moe_overrides,
460468
)
461469

462470
# Expose moe_config for the MoE parallelizer assertion

nemo_automodel/components/models/glm4_moe/model.py

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -93,17 +93,26 @@ def init_weights(self, buffer_device: torch.device):
9393

9494

9595
class Glm4MoeModel(nn.Module):
96-
def __init__(self, config: Glm4MoeConfig, backend: BackendConfig, *, moe_config: MoEConfig | None = None):
96+
def __init__(
97+
self,
98+
config: Glm4MoeConfig,
99+
backend: BackendConfig,
100+
*,
101+
moe_config: MoEConfig | None = None,
102+
moe_overrides: dict | None = None,
103+
):
97104
super().__init__()
98105
self.backend = backend
99106
self.config = config
107+
if moe_config is not None and moe_overrides is not None:
108+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
100109

101110
# Map HF GLM4 MoE config -> our MoE wrapper
102111
# GLM4 MoE config fields:
103112
# - hidden_size, intermediate_size, moe_intermediate_size
104113
# - n_routed_experts, n_shared_experts, num_experts_per_tok
105114
# - n_group, topk_group, routed_scaling_factor, norm_topk_prob
106-
self.moe_config = moe_config or MoEConfig(
115+
moe_defaults = dict(
107116
dim=config.hidden_size,
108117
inter_dim=config.intermediate_size,
109118
moe_inter_dim=config.moe_intermediate_size,
@@ -113,7 +122,7 @@ def __init__(self, config: Glm4MoeConfig, backend: BackendConfig, *, moe_config:
113122
n_expert_groups=config.n_group,
114123
n_limited_groups=config.topk_group,
115124
train_gate=True,
116-
gate_bias_update_factor=0.001,
125+
gate_bias_update_factor=1e-3,
117126
score_func="sigmoid", # GLM4 MoE uses sigmoid scoring with groups
118127
route_scale=config.routed_scaling_factor,
119128
aux_loss_coeff=0.0, # GLM4 MoE doesn't use aux loss in the HF implementation
@@ -123,6 +132,9 @@ def __init__(self, config: Glm4MoeConfig, backend: BackendConfig, *, moe_config:
123132
expert_activation="swiglu",
124133
softmax_before_topk=False, # GLM4 uses sigmoid, not softmax
125134
)
135+
if moe_overrides:
136+
moe_defaults.update(moe_overrides)
137+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
126138

127139
self.embed_tokens = nn.Embedding(
128140
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
@@ -238,7 +250,13 @@ def __init__(
238250
super().__init__()
239251
self.config = config
240252
self.backend = backend or BackendConfig()
241-
self.model = Glm4MoeModel(config, backend=self.backend, moe_config=moe_config)
253+
moe_overrides = kwargs.pop("moe_overrides", None)
254+
self.model = Glm4MoeModel(
255+
config,
256+
backend=self.backend,
257+
moe_config=moe_config,
258+
moe_overrides=moe_overrides,
259+
)
242260
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
243261
if self.backend.enable_hf_state_dict_adapter:
244262
self.state_dict_adapter = Glm4MoeStateDictAdapter(

nemo_automodel/components/models/glm4_moe_lite/model.py

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -101,13 +101,22 @@ def init_weights(self, buffer_device: torch.device):
101101

102102

103103
class Glm4MoeLiteModel(nn.Module):
104-
def __init__(self, config: Any, backend: BackendConfig, *, moe_config: MoEConfig | None = None):
104+
def __init__(
105+
self,
106+
config: Any,
107+
backend: BackendConfig,
108+
*,
109+
moe_config: MoEConfig | None = None,
110+
moe_overrides: dict | None = None,
111+
):
105112
super().__init__()
106113
self.backend = backend
107114
self.config = config
115+
if moe_config is not None and moe_overrides is not None:
116+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
108117

109118
# Map config -> MoE wrapper (same as GLM4 MoE)
110-
self.moe_config = moe_config or MoEConfig(
119+
moe_defaults = dict(
111120
dim=config.hidden_size,
112121
inter_dim=config.intermediate_size,
113122
moe_inter_dim=config.moe_intermediate_size,
@@ -117,7 +126,7 @@ def __init__(self, config: Any, backend: BackendConfig, *, moe_config: MoEConfig
117126
n_expert_groups=config.n_group,
118127
n_limited_groups=config.topk_group,
119128
train_gate=True,
120-
gate_bias_update_factor=0.001,
129+
gate_bias_update_factor=1e-3,
121130
score_func="sigmoid", # GLM4 MoE uses sigmoid scoring with groups
122131
route_scale=config.routed_scaling_factor,
123132
aux_loss_coeff=0.0, # GLM4 MoE doesn't use aux loss in the HF implementation
@@ -127,6 +136,9 @@ def __init__(self, config: Any, backend: BackendConfig, *, moe_config: MoEConfig
127136
expert_activation="swiglu",
128137
softmax_before_topk=False, # GLM4 uses sigmoid, not softmax
129138
)
139+
if moe_overrides:
140+
moe_defaults.update(moe_overrides)
141+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
130142

131143
self.embed_tokens = nn.Embedding(
132144
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
@@ -239,7 +251,13 @@ def __init__(
239251
super().__init__()
240252
self.config = config
241253
self.backend = backend or BackendConfig()
242-
self.model = Glm4MoeLiteModel(config, backend=self.backend, moe_config=moe_config)
254+
moe_overrides = kwargs.pop("moe_overrides", None)
255+
self.model = Glm4MoeLiteModel(
256+
config,
257+
backend=self.backend,
258+
moe_config=moe_config,
259+
moe_overrides=moe_overrides,
260+
)
243261
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
244262
if self.backend.enable_hf_state_dict_adapter:
245263
self.state_dict_adapter = Glm4MoeStateDictAdapter(

nemo_automodel/components/models/glm_moe_dsa/model.py

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,12 +94,21 @@ def init_weights(self, buffer_device: torch.device):
9494

9595

9696
class GlmMoeDsaModel(nn.Module):
97-
def __init__(self, config: GlmMoeDsaConfig, backend: BackendConfig, *, moe_config: MoEConfig | None = None):
97+
def __init__(
98+
self,
99+
config: GlmMoeDsaConfig,
100+
backend: BackendConfig,
101+
*,
102+
moe_config: MoEConfig | None = None,
103+
moe_overrides: dict | None = None,
104+
):
98105
super().__init__()
99106
self.backend = backend
100107
self.config = config
108+
if moe_config is not None and moe_overrides is not None:
109+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
101110

102-
self.moe_config = moe_config or MoEConfig(
111+
moe_defaults = dict(
103112
dim=config.hidden_size,
104113
inter_dim=config.intermediate_size,
105114
moe_inter_dim=config.moe_intermediate_size,
@@ -109,7 +118,7 @@ def __init__(self, config: GlmMoeDsaConfig, backend: BackendConfig, *, moe_confi
109118
n_expert_groups=config.n_group,
110119
n_limited_groups=config.topk_group,
111120
train_gate=True,
112-
gate_bias_update_factor=0.001,
121+
gate_bias_update_factor=1e-3,
113122
score_func="sigmoid",
114123
route_scale=config.routed_scaling_factor,
115124
aux_loss_coeff=0.0,
@@ -119,6 +128,9 @@ def __init__(self, config: GlmMoeDsaConfig, backend: BackendConfig, *, moe_confi
119128
expert_activation="swiglu",
120129
softmax_before_topk=False,
121130
)
131+
if moe_overrides:
132+
moe_defaults.update(moe_overrides)
133+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
122134

123135
self.embed_tokens = nn.Embedding(
124136
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
@@ -227,7 +239,13 @@ def __init__(
227239
super().__init__()
228240
self.config = config
229241
self.backend = backend or BackendConfig()
230-
self.model = GlmMoeDsaModel(config, backend=self.backend, moe_config=moe_config)
242+
moe_overrides = kwargs.pop("moe_overrides", None)
243+
self.model = GlmMoeDsaModel(
244+
config,
245+
backend=self.backend,
246+
moe_config=moe_config,
247+
moe_overrides=moe_overrides,
248+
)
231249
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
232250
if self.backend.enable_hf_state_dict_adapter:
233251
self.state_dict_adapter = GlmMoeDsaStateDictAdapter(

nemo_automodel/components/models/gpt_oss/model.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,12 +88,21 @@ def init_weights(self, buffer_device: torch.device):
8888

8989

9090
class GptOssModel(nn.Module):
91-
def __init__(self, config: GptOssConfig, backend: BackendConfig, *, moe_config: MoEConfig | None = None):
91+
def __init__(
92+
self,
93+
config: GptOssConfig,
94+
backend: BackendConfig,
95+
*,
96+
moe_config: MoEConfig | None = None,
97+
moe_overrides: dict | None = None,
98+
):
9299
super().__init__()
93100
self.backend = backend
94101
self.config = config
102+
if moe_config is not None and moe_overrides is not None:
103+
raise ValueError("Cannot pass both moe_config and moe_overrides; use one or the other.")
95104
# GPT-OSS is MoE everywhere; set shared experts to 0 to disable shared path in our MoE wrapper.
96-
self.moe_config = moe_config or MoEConfig(
105+
moe_defaults = dict(
97106
dim=config.hidden_size,
98107
inter_dim=config.intermediate_size,
99108
moe_inter_dim=config.intermediate_size,
@@ -114,6 +123,9 @@ def __init__(self, config: GptOssConfig, backend: BackendConfig, *, moe_config:
114123
activation_alpha=1.702,
115124
activation_limit=getattr(config, "swiglu_limit", 7.0),
116125
)
126+
if moe_overrides:
127+
moe_defaults.update(moe_overrides)
128+
self.moe_config = moe_config or MoEConfig(**moe_defaults)
117129

118130
self.embed_tokens = nn.Embedding(
119131
config.vocab_size, config.hidden_size, dtype=get_dtype(config.torch_dtype, torch.bfloat16)
@@ -223,7 +235,8 @@ def __init__(
223235
super().__init__()
224236
self.config = config
225237
self.backend = backend or BackendConfig(attn="flex")
226-
self.model = GptOssModel(config, backend=self.backend, moe_config=moe_config)
238+
moe_overrides = kwargs.pop("moe_overrides", None)
239+
self.model = GptOssModel(config, backend=self.backend, moe_config=moe_config, moe_overrides=moe_overrides)
227240
self.lm_head = initialize_linear_module(self.backend.linear, config.hidden_size, config.vocab_size, bias=False)
228241
if self.backend.enable_hf_state_dict_adapter:
229242
self.state_dict_adapter = GPTOSSStateDictAdapter(

0 commit comments

Comments
 (0)