Commit 05c6d3b
Add MoE/Nemotron fixes to support Transformers 5.5
Tested with both transformers 4.57 and 5.5.
## Root cause
transformers 5.5 natively supports NemotronHForCausalLM (with `model.`
prefix), but all puzzletron checkpoints use the trust_remote_code class
(with `backbone.` prefix). Additionally, the native NemotronHConfig does
not recognize the `-` pattern character used by NemotronH v2 for MLP layers.
## Fixes
**trust_remote_code model class selection (4 places)**
For trust_remote_code models, always force `AutoModelForCausalLM.from_config(
trust_remote_code=True)` instead of the native concrete class, which has
a different module structure (backbone. vs model. prefix). Applied in:
- `sharded_checkpoint_utils.py` create_sharded_model
- `init_child_from_parent.py` (fixes KeyError on backbone.layers.N.mixer.experts keys)
- `checkpoint_utils_hf.py` init_model_from_config (fixes AttributeError in
calc_subblock_params_and_memory)
- `tests/_test_utils/torch/puzzletron/utils.py` create_and_save_small_hf_model
**NemotronH embedding key name (singular vs plural)**
`nemotron_h_model_descriptor.py` layer_name_predicates: make `s` optional
(`backbone\.embeddings?\.weight`) to match both the on-disk singular form
(`backbone.embedding.weight`) produced by transformers 5.5 revert_weight_conversion
and the in-memory plural form.
**Test checkpoint save format**
`utils.py` create_and_save_small_hf_model:
- Use `save_pretrained(save_original_format=False)` to skip transformers 5.5
revert_weight_conversion, which would rename backbone.embeddings.weight ->
backbone.embedding.weight and cause load_and_shard_model key mismatches.
- Handle AttributeError from _tied_weights_keys being a list (trust_remote_code)
vs dict (transformers v5 expectation) by clearing it and retrying.
- Add `config.moe_latent_size = None` guard for native NemotronH config access.
- Download trust_remote_code .py files via snapshot_download for models with
auto_map, since save_pretrained does not copy them.
**NemotronH v2 tokenizer loading**
`validate_model.py` prepare_dataloader: auto-detect trust_remote_code from
the descriptor (args.descriptor is always set in puzzletron configs) when
not explicitly configured. Fixes NemotronH v2 where native NemotronHConfig.
_pattern_to_list only handles {M, E, *} but v2 uses `-` for MLP layers.
**Qwen3VL / transformers 5.x expert hook**
`expert_removal_hooks.py`:
- Gate returns (logits, aux_loss) tuple in transformers 5.x; unpack it.
- Use hidden_states.shape[-1] instead of self.moe.hidden_size (removed in v5).
- Version-branch the experts call: transformers 5.x uses grouped_mm signature
(hidden_flat, top_k_index, top_k_weights) vs 4.x loop-based
(hidden_3d, routing_weights_full, router_indices).
**GPT-OSS attention_type**
`gpt_oss_model_descriptor.py`: use getattr(layer, "attention_type", None)
since the attribute was removed in transformers v5.4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>1 parent 7053c61 commit 05c6d3b
11 files changed
Lines changed: 134 additions & 30 deletions
File tree
- examples/puzzletron
- modelopt/torch
- prune/importance_hooks
- puzzletron
- anymodel/models
- gpt_oss
- nemotron_h
- tools
- bypassed_training
- tests
- _test_utils/torch/puzzletron
- gpu/torch/puzzletron/resources/configs/openai/gpt-oss-20b
- pruning
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
Lines changed: 26 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
| 24 | + | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
| |||
359 | 361 | | |
360 | 362 | | |
361 | 363 | | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
362 | 368 | | |
363 | 369 | | |
364 | | - | |
| 370 | + | |
365 | 371 | | |
366 | 372 | | |
367 | 373 | | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
368 | 377 | | |
369 | 378 | | |
370 | | - | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
371 | 382 | | |
372 | 383 | | |
373 | | - | |
374 | | - | |
375 | | - | |
376 | | - | |
377 | | - | |
378 | | - | |
379 | | - | |
380 | | - | |
381 | 384 | | |
382 | | - | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
383 | 398 | | |
384 | 399 | | |
385 | 400 | | |
| |||
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
58 | | - | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
200 | 200 | | |
201 | 201 | | |
202 | 202 | | |
203 | | - | |
| 203 | + | |
204 | 204 | | |
205 | 205 | | |
206 | 206 | | |
| |||
Lines changed: 11 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
44 | 48 | | |
45 | 49 | | |
| |||
126 | 130 | | |
127 | 131 | | |
128 | 132 | | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
133 | 137 | | |
134 | 138 | | |
| 139 | + | |
| 140 | + | |
135 | 141 | | |
136 | 142 | | |
137 | 143 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
136 | 163 | | |
137 | 164 | | |
138 | 165 | | |
| |||
145 | 172 | | |
146 | 173 | | |
147 | 174 | | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
148 | 178 | | |
149 | | - | |
150 | | - | |
151 | | - | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
152 | 182 | | |
153 | 183 | | |
154 | 184 | | |
| |||
Lines changed: 9 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
46 | 49 | | |
47 | 50 | | |
48 | 51 | | |
| |||
172 | 175 | | |
173 | 176 | | |
174 | 177 | | |
175 | | - | |
176 | | - | |
177 | 178 | | |
178 | 179 | | |
179 | 180 | | |
| |||
239 | 240 | | |
240 | 241 | | |
241 | 242 | | |
242 | | - | |
243 | | - | |
244 | | - | |
245 | | - | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
246 | 249 | | |
247 | 250 | | |
248 | 251 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
235 | 235 | | |
236 | 236 | | |
237 | 237 | | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
238 | 248 | | |
239 | 249 | | |
240 | | - | |
| 250 | + | |
241 | 251 | | |
242 | 252 | | |
243 | 253 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
135 | 136 | | |
136 | 137 | | |
137 | 138 | | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
138 | 144 | | |
139 | 145 | | |
140 | 146 | | |
| |||
167 | 173 | | |
168 | 174 | | |
169 | 175 | | |
170 | | - | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
171 | 191 | | |
172 | 192 | | |
173 | 193 | | |
174 | 194 | | |
175 | 195 | | |
176 | 196 | | |
177 | 197 | | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
178 | 208 | | |
179 | 209 | | |
180 | 210 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
| |||
97 | 98 | | |
98 | 99 | | |
99 | 100 | | |
| 101 | + | |
100 | 102 | | |
101 | 103 | | |
102 | 104 | | |
| |||
0 commit comments