|
| 1 | +# AnyModel Guide |
| 2 | + |
| 3 | +This guide explains how to add support for new models in the Puzzletron pipeline. |
| 4 | + |
| 5 | +## Convert model |
| 6 | + |
| 7 | +Convert a HuggingFace model to Puzzletron format. |
| 8 | + |
| 9 | +Step 1: Create Model Descriptor |
| 10 | + |
| 11 | +Extend `ModelDescriptor` and implement `layer_name_predicates()` to define regex patterns for grouping weights into subblocks (embeddings, lm_head, block_N_ffn, block_N_attention). |
| 12 | + |
| 13 | +Key points: |
| 14 | + |
| 15 | +- Find weight names on the model's HuggingFace page → click "Files info" to see the safetensors structure with all tensor names (example: [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct?show_file_info=model.safetensors.index.json)) |
| 16 | + |
| 17 | +See example: [llama_model_descriptor.py](models/llama/llama_model_descriptor.py) |
| 18 | + |
| 19 | +Step 2: Create Converter |
| 20 | + |
| 21 | +Extend `Converter` and implement `create_block_configs_from_main_config()` to create per-layer BlockConfigs from the HuggingFace config. |
| 22 | + |
| 23 | +Key points: |
| 24 | + |
| 25 | +- Import correct HuggingFace config class (e.g., `MistralConfig`, `LlamaConfig`, `Qwen2Config`). Find it in the transformers source: `github.com/huggingface/transformers/tree/main/src/transformers/models/<model_type>/configuration_<model_type>.py` |
| 26 | + |
| 27 | +See example: [llama_converter.py](models/llama/llama_converter.py) |
| 28 | + |
| 29 | +Step 3: Create `models/<model_name>/__init__.py` |
| 30 | + |
| 31 | +Export descriptor and converter classes: |
| 32 | + |
| 33 | +```python |
| 34 | +from models.<model_name>.<model_name>_model_descriptor import MyModelDescriptor |
| 35 | +from models.<model_name>.<model_name>_converter import MyConverter |
| 36 | +``` |
| 37 | + |
| 38 | +Step 4: Register in `models/__init__.py` |
| 39 | + |
| 40 | +Add import to trigger factory registration: |
| 41 | + |
| 42 | +```python |
| 43 | +from models.<model_name> import * |
| 44 | +``` |
| 45 | + |
| 46 | +## Usage |
| 47 | + |
| 48 | +```python |
| 49 | +from modelopt.torch.puzzletron.anymodel import convert_model |
| 50 | + |
| 51 | +convert_model( |
| 52 | + input_dir="path/to/hf_checkpoint", |
| 53 | + output_dir="path/to/puzzletron_checkpoint", |
| 54 | + converter="model_name", |
| 55 | +) |
| 56 | +``` |
| 57 | + |
| 58 | +## Compress model |
| 59 | + |
| 60 | +Run pruning and compression on a Puzzletron model. |
| 61 | + |
| 62 | +Step 1: Implement ModelDescriptor methods for compression |
| 63 | + |
| 64 | +Add to your `ModelDescriptor`: |
| 65 | + |
| 66 | +- `decoder_layer_cls()` - return the decoder layer class(es) to patch for heterogeneous config support |
| 67 | +- `block_config_to_layer_overrides()` - map BlockConfig to layer override dict (see [details](#implementing-block_config_to_layer_overrides)) |
| 68 | +- `init_rotary_embedding()` - reinitialize rotary embeddings after model loading (see [details](#implementing-init_rotary_embedding)) |
| 69 | +- `input_embedding_name()` - return the name of the input embedding layer (see [details](#implementing-path-based-methods)) |
| 70 | +- `output_embedding_name()` - return the name of the output embedding layer (see [details](#implementing-path-based-methods)) |
| 71 | +- `layer_block_name()` - return the name pattern for decoder layers (see [details](#implementing-path-based-methods)) |
| 72 | +- `final_norm_name()` - return the name of the final normalization layer (see [details](#implementing-path-based-methods)) |
| 73 | +- `attn_no_op_post_init()` - replace attention sublayers with no-op modules |
| 74 | +- `mlp_no_op_post_init()` - replace MLP sublayers with no-op modules |
| 75 | + |
| 76 | +Step 2: Create FFN Layer Descriptor |
| 77 | + |
| 78 | +Extend `FFNIntermediateLayerDescriptor` to define model-specific paths for FFN pruning hooks (`down_proj_name`, `ffn_prefix_name`, `linear_weight_names`). Derive values from your model's weight names in `layer_name_predicates()`. |
| 79 | + |
| 80 | +See example: [llama_model_descriptor.py](models/llama/llama_model_descriptor.py) → `LlamaFFNIntermediateLayerDescriptor` |
| 81 | + |
| 82 | +Step 3: Configure YAML files |
| 83 | + |
| 84 | +Update the main model config YAML: |
| 85 | + |
| 86 | +- Set `descriptor` to match the name used in `@ModelDescriptorFactory.register_decorator("your_model_name")` |
| 87 | +- See example: [llama_3_1_8b_instruct.yaml](../../../../tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/llama_3_1_8b_instruct.yaml) |
| 88 | + |
| 89 | +Update pruning YAML files (`ffn_pruning.yaml`, `expert_pruning.yaml`, etc.): |
| 90 | + |
| 91 | +- Set `pruning_mixin._target_` to the appropriate mixin class |
| 92 | +- Set `layer_descriptor._target_` to your layer descriptor class |
| 93 | +- Set `hook_class` to the activation hook for scoring |
| 94 | +- Set `target_layer` in `activation_hooks_kwargs` to the layer name for hook attachment |
| 95 | +- See examples in [configs/llama_3_1_8b_instruct/pruning/](../../../../tests/gpu/torch/puzzletron/resources/configs/llama_3_1_8b_instruct/pruning/) |
| 96 | + |
| 97 | +## End-to-end example |
| 98 | + |
| 99 | +See [test_puzzletron.py](../../../../tests/gpu/torch/puzzletron/test_puzzletron.py) for a complete example that runs both convert and compression steps. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## Advanced Topics |
| 104 | + |
| 105 | +## Pruning Configuration |
| 106 | + |
| 107 | +### Pruning YAML Structure |
| 108 | + |
| 109 | +Each pruning type has a YAML config with these key fields: |
| 110 | + |
| 111 | +```yaml |
| 112 | +pruning_mixin: |
| 113 | + _target_: pruning.<type>_pruning_mixin.<MixinClass> |
| 114 | + layer_descriptor: |
| 115 | + _target_: models.<model>.<descriptor_class> |
| 116 | + |
| 117 | +hook_class: ${get_object:utils.activation_hooks.hooks.<HookClass>} |
| 118 | +activation_hooks_kwargs: |
| 119 | + method: <method_name> |
| 120 | + target_layer: "<layer.name>" # e.g., "mlp.down_proj", "self_attn.o_proj" |
| 121 | +``` |
| 122 | +
|
| 123 | +| Field | Description | |
| 124 | +|-------|-------------| |
| 125 | +| `pruning_mixin._target_` | Mixin class that orchestrates this pruning type | |
| 126 | +| `layer_descriptor._target_` | Model-specific class defining layer paths for hooks | |
| 127 | +| `hook_class` | Activation hook class for importance scoring | |
| 128 | +| `target_layer` | Layer name (relative to decoder block) where hooks attach | |
| 129 | + |
| 130 | +### Adding a New Hook Class |
| 131 | + |
| 132 | +1. **Implement the hook** in `modelopt/torch/nas/plugins/megatron_hooks/base_hooks.py`: |
| 133 | + - Extend an existing hook base class (e.g., `RemoveExpertsIndependentHook`) |
| 134 | + - Implement required methods (e.g., `get_router_logits_and_routed_experts`) |
| 135 | + |
| 136 | +2. **Register the hook** in the appropriate pruning mixin's `supported_hooks()`: |
| 137 | + |
| 138 | + For FFN pruning (`pruning/ffn_intermediate_pruning_mixin.py`): |
| 139 | + |
| 140 | + ```python |
| 141 | + def supported_hooks(self) -> List[Type[ActivationsHook]]: |
| 142 | + return [IndependentChannelContributionHook, IterativeChannelContributionHook, YourNewHook] |
| 143 | + ``` |
| 144 | + |
| 145 | + For expert removal (`pruning/expert_removal_pruning_mixin.py`): |
| 146 | + |
| 147 | + ```python |
| 148 | + def supported_hooks(self) -> List[Type[ActivationsHook]]: |
| 149 | + return [RankedChoiceVotingHook, ..., YourNewHook] |
| 150 | + ``` |
| 151 | + |
| 152 | +3. **Reference in YAML**: |
| 153 | + |
| 154 | + ```yaml |
| 155 | + hook_class: ${get_object:utils.activation_hooks.hooks.YourNewHook} |
| 156 | + ``` |
| 157 | + |
| 158 | +### Pruning Types Reference |
| 159 | + |
| 160 | +| Type | Mixin | Example Hooks | |
| 161 | +|------|-------|---------------| |
| 162 | +| FFN intermediate | [`FFNIntermediatePruningMixIn`](../pruning/ffn_intermediate_pruning_mixin.py) | [`IterativeChannelContributionHook`](../../../nas/plugins/megatron_hooks/base_hooks.py), [`IndependentChannelContributionHook`](../../../nas/plugins/megatron_hooks/base_hooks.py) | |
| 163 | +| Expert removal | [`ExpertRemovalPruningMixIn`](../pruning/expert_removal_pruning_mixin.py) | [`NemotronHRemoveExpertsIndependentHook`](../../../nas/plugins/megatron_hooks/base_hooks.py), [`Qwen3VLRemoveExpertsIndependentHook`](../../../nas/plugins/megatron_hooks/base_hooks.py) | |
| 164 | +| KV heads | [`KVHeadsPruningMixIn`](../pruning/kv_heads_pruning_mixin.py) | [`IndependentKvHeadContributionHook`](../../../nas/plugins/megatron_hooks/base_hooks.py) | |
| 165 | + |
| 166 | +## Implementing `block_config_to_layer_overrides` |
| 167 | + |
| 168 | +Maps Puzzletron's [`BlockConfig`](../decilm/deci_lm_hf_code/block_config.py) fields to HuggingFace config attribute names. Only override attributes that change during pruning: |
| 169 | + |
| 170 | +| BlockConfig Field | HuggingFace Attribute (check `config.json`) | |
| 171 | +|-------------------|---------------------------------------------| |
| 172 | +| `attention.num_key_value_heads` | `num_key_value_heads` | |
| 173 | +| `ffn.intermediate_size` | `intermediate_size` | |
| 174 | +| `ffn.moe.num_local_experts` | `num_experts` or `n_routed_experts` (model-specific) | |
| 175 | +| `ffn.moe.expert_intermediate_dim` | `moe_intermediate_size` | |
| 176 | + |
| 177 | +**Tip**: Check the model's `config.json` for exact attribute names - they vary between models. |
| 178 | + |
| 179 | +See examples: [qwen3_vl](models/qwen3_vl/qwen3_vl_model_descriptor.py), [nemotron_h](models/nemotron_h/nemotron_h_model_descriptor.py) |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## Implementing path-based methods |
| 184 | + |
| 185 | +These methods return paths derived from the model's weight names: |
| 186 | + |
| 187 | +- `input_embedding_name()`, `output_embedding_name()`, `layer_block_name()`, `final_norm_name()` |
| 188 | + |
| 189 | +Find them on the model's HuggingFace page → "Files info" → safetensors structure (example: [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct?show_file_info=model.safetensors.index.json)). |
| 190 | + |
| 191 | +See example: [llama_model_descriptor.py](models/llama/llama_model_descriptor.py) |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +## Implementing `init_rotary_embedding` |
| 196 | + |
| 197 | +Rotary embeddings are computed modules (not saved weights). After model sharding, they need re-initialization on the correct device/dtype. |
| 198 | + |
| 199 | +Look in `github.com/huggingface/transformers/tree/main/src/transformers/models/<model_type>/modeling_<model_type>.py` for: |
| 200 | + |
| 201 | +- `class.*Rotary` — the rotary embedding class name and constructor arguments |
| 202 | +- `self.rotary_emb` — the attribute path |
| 203 | + |
| 204 | +See example: [llama_model_descriptor.py](models/llama/llama_model_descriptor.py) |
0 commit comments