Add MoE/Nemotron fixes to support Transformers 5.5

kevalmorabia97 · claude · kevalmorabia97 · commit 05c6d3b1e453 · 2026-04-10T11:35:49.000-07:00
Tested with both transformers 4.57 and 5.5.

## Root cause

transformers 5.5 natively supports NemotronHForCausalLM (with `model.`
prefix), but all puzzletron checkpoints use the trust_remote_code class
(with `backbone.` prefix). Additionally, the native NemotronHConfig does
not recognize the `-` pattern character used by NemotronH v2 for MLP layers.

## Fixes

**trust_remote_code model class selection (4 places)**

For trust_remote_code models, always force `AutoModelForCausalLM.from_config(
trust_remote_code=True)` instead of the native concrete class, which has
a different module structure (backbone. vs model. prefix). Applied in:
- `sharded_checkpoint_utils.py` create_sharded_model
- `init_child_from_parent.py` (fixes KeyError on backbone.layers.N.mixer.experts keys)
- `checkpoint_utils_hf.py` init_model_from_config (fixes AttributeError in
  calc_subblock_params_and_memory)
- `tests/_test_utils/torch/puzzletron/utils.py` create_and_save_small_hf_model

**NemotronH embedding key name (singular vs plural)**

`nemotron_h_model_descriptor.py` layer_name_predicates: make `s` optional
(`backbone\.embeddings?\.weight`) to match both the on-disk singular form
(`backbone.embedding.weight`) produced by transformers 5.5 revert_weight_conversion
and the in-memory plural form.

**Test checkpoint save format**

`utils.py` create_and_save_small_hf_model:
- Use `save_pretrained(save_original_format=False)` to skip transformers 5.5
  revert_weight_conversion, which would rename backbone.embeddings.weight -&gt;
  backbone.embedding.weight and cause load_and_shard_model key mismatches.
- Handle AttributeError from _tied_weights_keys being a list (trust_remote_code)
  vs dict (transformers v5 expectation) by clearing it and retrying.
- Add `config.moe_latent_size = None` guard for native NemotronH config access.
- Download trust_remote_code .py files via snapshot_download for models with
  auto_map, since save_pretrained does not copy them.

**NemotronH v2 tokenizer loading**

`validate_model.py` prepare_dataloader: auto-detect trust_remote_code from
the descriptor (args.descriptor is always set in puzzletron configs) when
not explicitly configured. Fixes NemotronH v2 where native NemotronHConfig.
_pattern_to_list only handles {M, E, *} but v2 uses `-` for MLP layers.

**Qwen3VL / transformers 5.x expert hook**

`expert_removal_hooks.py`:
- Gate returns (logits, aux_loss) tuple in transformers 5.x; unpack it.
- Use hidden_states.shape[-1] instead of self.moe.hidden_size (removed in v5).
- Version-branch the experts call: transformers 5.x uses grouped_mm signature
  (hidden_flat, top_k_index, top_k_weights) vs 4.x loop-based
  (hidden_3d, routing_weights_full, router_indices).

**GPT-OSS attention_type**

`gpt_oss_model_descriptor.py`: use getattr(layer, "attention_type", None)
since the attribute was removed in transformers v5.4.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
diff --git a/examples/puzzletron/requirements.txt b/examples/puzzletron/requirements.txt
@@ -1,4 +1,5 @@
 lm-eval==0.4.10
 math-verify
 ray
+# Likely works for transformers v5 also, but we need to test it
 transformers<5.0
diff --git a/modelopt/torch/prune/importance_hooks/expert_removal_hooks.py b/modelopt/torch/prune/importance_hooks/expert_removal_hooks.py
@@ -20,6 +20,8 @@
 from typing import TYPE_CHECKING
 
 import torch
+import transformers
+from packaging.version import Version
 from torch import nn
 
 from .base_hooks import ForwardHook
@@ -359,27 +361,40 @@ def get_router_logits_and_routed_experts(
         Based on Qwen3VLMoeSparseMoe forward pass.
         """
         orig_shape = hidden_states.shape
+        # Use hidden_states.shape[-1] instead of self.moe.hidden_size for transformers v5 compatibility
+        hidden_size = (
+            self.moe.hidden_size if hasattr(self.moe, "hidden_size") else hidden_states.shape[-1]
+        )
 
         # Flatten to (num_tokens, hidden_size) for processing
-        hidden_states_flat = hidden_states.reshape(-1, self.moe.hidden_size)
+        hidden_states_flat = hidden_states.reshape(-1, hidden_size)
 
         if router_logits is None:
             router_logits = self.moe.gate(hidden_states_flat)
+            # In transformers vf the gate returns (logits, aux_loss) tuple
+            if isinstance(router_logits, tuple):
+                router_logits = router_logits[0]
 
         routing_weights = torch.nn.functional.softmax(router_logits, dim=-1, dtype=torch.float)
-        routing_weights, router_indices = torch.topk(routing_weights, self.moe.top_k, dim=-1)
+        routing_weights, router_indices = torch.topk(
+            routing_weights, self.num_experts_per_tok, dim=-1
+        )
         routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)
         routing_weights = routing_weights.to(hidden_states_flat.dtype)
-        router_weights = torch.zeros_like(router_logits).scatter_(
-            1, router_indices, routing_weights
-        )
-
-        # Reshape hidden_states for moe.experts (expects 3D: batch, seq, hidden)
-        # router_weights and router_indices remain 2D (num_tokens, num_experts)
-        batch_size = orig_shape[0] if hidden_states.ndim == 3 else 1
-        hidden_states_3d = hidden_states_flat.reshape(batch_size, -1, self.moe.hidden_size)
 
-        routed_out = self.moe.experts(hidden_states_3d, router_weights, router_indices)
+        if Version(transformers.__version__) >= Version("5.0"):
+            # transformers 5.x: grouped_mm_experts_forward expects
+            # (hidden_states_flat 2D, top_k_index, top_k_weights)
+            routed_out = self.moe.experts(hidden_states_flat, router_indices, routing_weights)
+        else:
+            # transformers 4.x: loop-based experts expects
+            # (hidden_states_3d 3D, routing_weights_full, router_indices)
+            batch_size = orig_shape[0] if hidden_states.ndim == 3 else 1
+            hidden_states_3d = hidden_states_flat.reshape(batch_size, -1, hidden_size)
+            router_weights = torch.zeros(
+                router_logits.shape, dtype=routing_weights.dtype, device=router_logits.device
+            ).scatter_(1, router_indices, routing_weights)
+            routed_out = self.moe.experts(hidden_states_3d, router_weights, router_indices)
 
         # Return in same shape as input
         routed_out = routed_out.reshape(*orig_shape)
diff --git a/modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py b/modelopt/torch/puzzletron/anymodel/models/gpt_oss/gpt_oss_model_descriptor.py
@@ -54,8 +54,9 @@ class GptOssModelDescriptor(ModelDescriptor):
     @classmethod
     def create_dummy_block(cls, original_layer: GptOssDecoderLayer, block_index: int) -> nn.Module:
         dummy_block = DummyBlock(block_index=block_index)
-        # Required by `GptOssModel.forward`.
-        dummy_block.attention_type = original_layer.attention_type
+        # Required by `GptOssModel.forward` in transformers<5.4
+        if hasattr(original_layer, "attention_type"):
+            dummy_block.attention_type = original_layer.attention_type
         return dummy_block
 
     @staticmethod
diff --git a/modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py b/modelopt/torch/puzzletron/anymodel/models/nemotron_h/nemotron_h_model_descriptor.py
@@ -200,7 +200,7 @@ def get_weight_groups(
     def layer_name_predicates(num_layers: int) -> Dict[str, re.Pattern]:
         layer_name_patterns = {
             "embeddings": re.compile(
-                r"^(model\.embed_tokens\.weight|backbone\.embeddings\.weight)$"
+                r"^(model\.embed_tokens\.weight|backbone\.embeddings?\.weight)$"
             ),
             "lm_head": re.compile(r"^(lm_head\.weight|backbone\.norm_f\.weight)$"),
         }
diff --git a/modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py b/modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py
@@ -39,7 +39,11 @@
     update_model_config,
 )
 from modelopt.torch.puzzletron.tools.checkpoint_utils import copy_tokenizer, load_state_dict
-from modelopt.torch.puzzletron.tools.checkpoint_utils_hf import _save_checkpoint, load_model_config
+from modelopt.torch.puzzletron.tools.checkpoint_utils_hf import (
+    _get_auto_class_for_trust_remote_code,
+    _save_checkpoint,
+    load_model_config,
+)
 from modelopt.torch.puzzletron.tools.logger import mprint
 from modelopt.torch.puzzletron.tools.sharded_checkpoint_utils import _get_model_class_from_config
 
@@ -126,12 +130,14 @@ def init_child_from_parent(
             model_descriptor=descriptor, block_configs=child_model_config.block_configs
         ):
             model_class = _get_model_class_from_config(child_model_config)
-            # AutoModelForCausalLM uses from_config(); concrete model classes use _from_config()
-            if model_class is AutoModelForCausalLM:
-                trust_remote_code = descriptor.requires_trust_remote_code()
-                child_model = model_class.from_config(
+            trust_remote_code = descriptor.requires_trust_remote_code()
+            if trust_remote_code:
+                auto_cls = _get_auto_class_for_trust_remote_code(child_model_config)
+                child_model = auto_cls.from_config(
                     child_model_config, trust_remote_code=trust_remote_code
                 )
+            elif model_class is AutoModelForCausalLM:
+                child_model = AutoModelForCausalLM.from_config(child_model_config)
             else:
                 child_model = model_class._from_config(child_model_config)
 
diff --git a/modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py b/modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
@@ -133,6 +133,33 @@ def _get_model_class_from_config(config: PretrainedConfig) -> type:
     return AutoModelForCausalLM
 
 
+def _get_auto_class_for_trust_remote_code(config: PretrainedConfig) -> type:
+    """Pick the right Auto class for a trust_remote_code model by inspecting auto_map.
+
+    When a model requires trust_remote_code, the native transformers class resolved from
+    config.architectures must NOT be used directly — it may have a different module structure
+    than the trust_remote_code class (e.g. NemotronH: native uses ``model.`` prefix, but the
+    trust_remote_code class uses ``backbone.`` prefix, causing key mismatches throughout the
+    pipeline). Instead, we route through the appropriate Auto class so that from_config()
+    resolves the class via auto_map, picking up the correct trust_remote_code implementation.
+
+    Models declare which Auto class they support via config.auto_map. We walk a priority list
+    so that CausalLM models and VL models (AutoModelForConditionalGeneration or similar) are
+    both handled correctly.
+    """
+    auto_map = getattr(config, "auto_map", {})
+    priority = [
+        "AutoModelForCausalLM",
+        "AutoModelForConditionalGeneration",
+        "AutoModelForImageTextToText",
+        "AutoModel",
+    ]
+    for name in priority:
+        if name in auto_map and hasattr(transformers, name):
+            return getattr(transformers, name)
+    return AutoModelForCausalLM
+
+
 def init_model_from_config(
     config: PretrainedConfig,
     *,
@@ -145,10 +172,13 @@ def init_model_from_config(
     Pass True when loading configs that rely on custom modeling code from the checkpoint.
     """
     model_class = _get_model_class_from_config(config)
+    if trust_remote_code:
+        auto_cls = _get_auto_class_for_trust_remote_code(config)
+        return auto_cls.from_config(config, trust_remote_code=trust_remote_code, **kwargs)
     if model_class is AutoModelForCausalLM:
-        return model_class.from_config(config, trust_remote_code=trust_remote_code, **kwargs)
-    # Concrete model classes (e.g. GptOssForCausalLM): _from_config forwards kwargs to __init__,
-    # which does not accept trust_remote_code (only AutoModel uses it when loading custom code).
+        return AutoModelForCausalLM.from_config(config, **kwargs)
+    # Concrete model classes (e.g. GptOssForCausalLM, Qwen3VLMoeForConditionalGeneration):
+    # _from_config forwards kwargs to __init__, which does not accept trust_remote_code.
     return model_class._from_config(config, **kwargs)
 
 
diff --git a/modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py b/modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py
@@ -43,6 +43,9 @@
 
 import modelopt.torch.utils.distributed as dist
 from modelopt.torch.puzzletron.tools.checkpoint_utils import load_model_config, load_state_dict
+from modelopt.torch.puzzletron.tools.checkpoint_utils_hf import (
+    _get_auto_class_for_trust_remote_code,
+)
 from modelopt.torch.puzzletron.tools.logger import mprint
 from modelopt.torch.puzzletron.utils.dummy_modules import (
     DummyBlock,
@@ -172,8 +175,6 @@ def load_and_shard_model(
                 device=runtime.device,
             )
 
-            new_names = set(shard_state_dict.keys())
-            mprint(f"{new_names=}")
             # strict=False: allows missing lm_head.weight when tie_word_embeddings=True (e.g., Llama 3.2 3B)
             model_shard.load_state_dict(shard_state_dict, strict=False, assign=True)
 
@@ -239,10 +240,12 @@ def create_sharded_model(
     with EmptyInitOnDevice(device="meta", dtype=dtype):
         # Get model class from config.architectures (works for CausalLM, VL models, etc.)
         model_class = _get_model_class_from_config(model_config)
-        # AutoModelForCausalLM uses from_config(); concrete model classes use _from_config()
-        if model_class is AutoModelForCausalLM:
-            trust_remote_code = descriptor.requires_trust_remote_code()
-            model = model_class.from_config(model_config, trust_remote_code=trust_remote_code)
+        trust_remote_code = descriptor.requires_trust_remote_code()
+        if trust_remote_code:
+            auto_cls = _get_auto_class_for_trust_remote_code(model_config)
+            model = auto_cls.from_config(model_config, trust_remote_code=trust_remote_code)
+        elif model_class is AutoModelForCausalLM:
+            model = AutoModelForCausalLM.from_config(model_config)
         else:
             model = model_class._from_config(model_config)
         create_local_shard_(
diff --git a/modelopt/torch/puzzletron/tools/validate_model.py b/modelopt/torch/puzzletron/tools/validate_model.py
@@ -235,9 +235,19 @@ def prepare_dataloader(
     if tokenizer is None:
         tokenizer_name = getattr(args, "tokenizer_name", None)
         assert (tokenizer_name is not None) or (args.model_name_or_path is not None)
+        # Auto-detect trust_remote_code from the descriptor when not explicitly set.
+        # Required for models like NemotronH v2 whose configs use characters (e.g. '-') that
+        # the native transformers NemotronHConfig._pattern_to_list doesn't support.
+        trust_remote_code = getattr(args, "trust_remote_code", False)
+        if not trust_remote_code and getattr(args, "descriptor", None):
+            try:
+                descriptor_cls = ModelDescriptorFactory.get(args.descriptor)
+                trust_remote_code = descriptor_cls.requires_trust_remote_code()
+            except Exception:
+                pass
         tokenizer = AutoTokenizer.from_pretrained(
             tokenizer_name or args.model_name_or_path,
-            trust_remote_code=getattr(args, "trust_remote_code", False),
+            trust_remote_code=trust_remote_code,
         )
 
     val_dataloader = create_validation_dataloader(
diff --git a/tests/_test_utils/torch/puzzletron/utils.py b/tests/_test_utils/torch/puzzletron/utils.py
@@ -19,6 +19,7 @@
 import torch
 from _test_utils.torch.transformers_models import get_tiny_tokenizer
 from datasets import Dataset, DatasetDict
+from huggingface_hub import snapshot_download
 from transformers import AutoConfig, AutoModelForCausalLM, PreTrainedTokenizerBase
 
 import modelopt.torch.utils.distributed as dist
@@ -135,6 +136,11 @@ def create_and_save_small_hf_model(
         ):
             config.pad_token_id = 0
 
+        # Ensure moe_latent_size is present: the native transformers NemotronH model (>=5.5)
+        # accesses config.moe_latent_size but older trust_remote_code configs don't define it.
+        if not hasattr(config, "moe_latent_size"):
+            config.moe_latent_size = None
+
     # Set seed for reproducible weight initialization
     torch.manual_seed(42)
 
@@ -167,14 +173,38 @@ def create_and_save_small_hf_model(
     else:
         os.environ.pop("CUDA_VISIBLE_DEVICES", None)
 
-    model.to(dtype=torch.bfloat16).save_pretrained(output_path)
+    model.to(dtype=torch.bfloat16)
+    # save_original_format=False: skip transformers' revert_weight_conversion so weights are saved
+    # with in-memory key names (e.g. backbone.embeddings.weight) rather than the on-disk "original"
+    # format (e.g. backbone.embedding.weight for NemotronH). This avoids key mismatches in
+    # load_and_shard_model which looks up shard keys from model.named_parameters().
+    try:
+        model.save_pretrained(output_path, save_original_format=False)
+    except AttributeError:
+        # Workaround: some trust_remote_code models define _tied_weights_keys in an older
+        # format (returning a list) that is incompatible with transformers v5, which
+        # expects _get_tied_weight_keys to return a dict. Clear tied weight keys and retry.
+        for submodule in model.modules():
+            if getattr(submodule, "_tied_weights_keys", None) is not None:
+                submodule._tied_weights_keys = None
+        model.save_pretrained(output_path, save_original_format=False)
 
     # Save tokenizer
     tokenizer.save_pretrained(output_path)
 
     # Save config
     config.save_pretrained(output_path)
 
+    # Download trust_remote_code .py files from HF hub into the checkpoint directory so that
+    # force_cache_dynamic_modules can resolve classes from the local path.
+    # save_pretrained only saves weights + config, not these .py files.
+    if hasattr(config, "auto_map") and isinstance(config.auto_map, dict):
+        snapshot_download(
+            repo_id=hf_model_name,
+            local_dir=output_path,
+            allow_patterns=["*.py"],
+        )
+
 
 def save_dummy_dataset(dataset_path: Path | str):
     """
diff --git a/tests/gpu/torch/puzzletron/resources/configs/openai/gpt-oss-20b/gpt-oss-20b.yaml b/tests/gpu/torch/puzzletron/resources/configs/openai/gpt-oss-20b/gpt-oss-20b.yaml
@@ -44,6 +44,7 @@ scoring:
 
   eval_samples: 2
   micro_batch_size: 1
+  block_size: 512  # Toy model has max_position_embeddings=512; attention is O(batch*heads*seq^2)
   dataset_path: ${dataset_path}/valid
   seed: 42
   shuffle_seed: 444
@@ -97,6 +98,7 @@ realize_model:
   skip_validation: false # To enable validation of the model solution set `skip_validation` as False
   eval_samples: 2
   micro_batch_size: 1
+  block_size: 512  # Toy model has max_position_embeddings=512; attention is O(batch*heads*seq^2)
   dataset_path: ${dataset_path}/valid
   seed: 42
   shuffle_seed: 444
diff --git a/tests/gpu/torch/puzzletron/resources/configs/openai/gpt-oss-20b/pruning/expert_removal.yaml b/tests/gpu/torch/puzzletron/resources/configs/openai/gpt-oss-20b/pruning/expert_removal.yaml
@@ -2,6 +2,12 @@ defaults:
   - /pruning/pruning_defaults@_here_
 
 eval_samples: 10
+# Toy test model has max_position_embeddings=512 and num_attention_heads=32.
+# Attention is O(batch * heads * seq^2), so we must keep batch and seq small.
+# pruning_defaults uses micro_batch_size=4 and block_size=8192, which creates
+# (4, 32, 8192, 8192) = 16 GiB attn tensors even with a tiny hidden_size.
+micro_batch_size: 1
+block_size: 512
 activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/expert_removal/${pruning.experiment_id}
 
 pruning_mixin:

Original file line number	Diff line number	Diff line change
`@@ -200,7 +200,7 @@ def get_weight_groups(`
`200`	`200`	`def layer_name_predicates(num_layers: int) -> Dict[str, re.Pattern]:`
`201`	`201`	`layer_name_patterns = {`
`202`	`202`	`"embeddings": re.compile(`
`203`		`- r"^(model\.embed_tokens\.weight\|backbone\.embeddings\.weight)$"`
	`203`	`+ r"^(model\.embed_tokens\.weight\|backbone\.embeddings?\.weight)$"`
`204`	`204`	`),`
`205`	`205`	`"lm_head": re.compile(r"^(lm_head\.weight\|backbone\.norm_f\.weight)$"),`
`206`	`206`	`}`