This guide explains how to add support for new models in the Puzzletron pipeline.
Convert a HuggingFace model to Puzzletron format.
Step 1: Create Model Descriptor
Extend ModelDescriptor and implement layer_name_predicates() to define regex patterns for grouping weights into subblocks (embeddings, lm_head, block_N_ffn, block_N_attention).
Key points:
- Find weight names on the model's HuggingFace page → click "Files info" to see the safetensors structure with all tensor names (example: Llama-3.1-8B-Instruct)
See example: llama_model_descriptor.py
Step 2: Create Converter
Extend Converter and implement create_block_configs_from_main_config() to create per-layer BlockConfigs from the HuggingFace config.
Key points:
- Import correct HuggingFace config class (e.g.,
MistralConfig,LlamaConfig,Qwen2Config). Find it in the transformers source:github.com/huggingface/transformers/tree/main/src/transformers/models/<model_type>/configuration_<model_type>.py
See example: llama_converter.py
Step 3: Create models/<model_name>/__init__.py
Export descriptor and converter classes:
from models.<model_name>.<model_name>_model_descriptor import MyModelDescriptor
from models.<model_name>.<model_name>_converter import MyConverterStep 4: Register in models/__init__.py
Add import to trigger factory registration:
from models.<model_name> import *from modelopt.torch.puzzletron.anymodel import convert_model
convert_model(
input_dir="path/to/hf_checkpoint",
output_dir="path/to/puzzletron_checkpoint",
converter="model_name",
)Run pruning and compression on a Puzzletron model.
Step 1: Implement ModelDescriptor methods for compression
Add to your ModelDescriptor:
decoder_layer_cls()- return the decoder layer class(es) to patch for heterogeneous config supportblock_config_to_layer_overrides()- map BlockConfig to layer override dict (see details)init_rotary_embedding()- reinitialize rotary embeddings after model loading (see details)input_embedding_name()- return the name of the input embedding layer (see details)output_embedding_name()- return the name of the output embedding layer (see details)layer_block_name()- return the name pattern for decoder layers (see details)final_norm_name()- return the name of the final normalization layer (see details)attn_no_op_post_init()- replace attention sublayers with no-op modulesmlp_no_op_post_init()- replace MLP sublayers with no-op modules
Step 2: Create FFN Layer Descriptor
Extend FFNIntermediateLayerDescriptor to define model-specific paths for FFN pruning hooks (down_proj_name, ffn_prefix_name, linear_weight_names). Derive values from your model's weight names in layer_name_predicates().
See example: llama_model_descriptor.py → LlamaFFNIntermediateLayerDescriptor
Step 3: Configure YAML files
Update the main model config YAML:
- Set
descriptorto match the name used in@ModelDescriptorFactory.register_decorator("your_model_name") - See example: llama_3_1_8b_instruct.yaml
Update pruning YAML files (ffn_pruning.yaml, expert_pruning.yaml, etc.):
- Set
pruning_mixin._target_to the appropriate mixin class - Set
layer_descriptor._target_to your layer descriptor class - Set
hook_classto the activation hook for scoring - Set
target_layerinactivation_hooks_kwargsto the layer name for hook attachment - See examples in configs/llama_3_1_8b_instruct/pruning/
See test_puzzletron.py for a complete example that runs both convert and compression steps. For container setup and dependencies needed to run this test, see the Puzzletron README environment section.
Each pruning type has a YAML config with these key fields:
pruning_mixin:
_target_: pruning.<type>_pruning_mixin.<MixinClass>
layer_descriptor:
_target_: models.<model>.<descriptor_class>
hook_class: ${get_object:utils.activation_hooks.hooks.<HookClass>}
activation_hooks_kwargs:
method: <method_name>
target_layer: "<layer.name>" # e.g., "mlp.down_proj", "self_attn.o_proj"| Field | Description |
|---|---|
pruning_mixin._target_ |
Mixin class that orchestrates this pruning type |
layer_descriptor._target_ |
Model-specific class defining layer paths for hooks |
hook_class |
Activation hook class for importance scoring |
target_layer |
Layer name (relative to decoder block) where hooks attach |
-
Implement the hook under
modelopt/torch/prune/importance_hooks/(e.g.base_hooks.pyfor generic hooks,expert_removal_hooks.pyfor MoE expert removal):- Extend an existing hook base class (e.g.,
RemoveExpertsIndependentHookinexpert_removal_hooks.py) - Implement required methods (e.g.,
get_router_logits_and_routed_experts)
- Extend an existing hook base class (e.g.,
-
Register the hook in the appropriate pruning mixin's
supported_hooks():For FFN pruning (
pruning/ffn_intermediate_pruning_mixin.py):def supported_hooks(self) -> List[Type[ActivationsHook]]: return [IndependentChannelContributionHook, IterativeChannelContributionHook, YourNewHook]
For expert removal (
pruning/expert_removal_pruning_mixin.py):def supported_hooks(self) -> List[Type[ActivationsHook]]: return [RankedChoiceVotingHook, ..., YourNewHook]
-
Reference in YAML:
hook_class: ${get_object:utils.activation_hooks.hooks.YourNewHook}
| Type | Mixin | Example Hooks |
|---|---|---|
| FFN intermediate | FFNIntermediatePruningMixIn |
IterativeChannelContributionHook, IndependentChannelContributionHook |
| Expert removal | ExpertRemovalPruningMixIn |
NemotronHRemoveExpertsIndependentHook, Qwen3VLRemoveExpertsIndependentHook |
| KV heads | KVHeadsPruningMixIn |
IndependentKvHeadContributionHook |
Maps Puzzletron's BlockConfig fields to HuggingFace config attribute names. Only override attributes that change during pruning:
| BlockConfig Field | HuggingFace Attribute (check config.json) |
|---|---|
attention.num_key_value_heads |
num_key_value_heads |
ffn.intermediate_size |
intermediate_size |
ffn.moe.num_local_experts |
num_experts or n_routed_experts (model-specific) |
ffn.moe.expert_intermediate_dim |
moe_intermediate_size |
Tip: Check the model's config.json for exact attribute names - they vary between models.
See examples: qwen3_vl, nemotron_h
These methods return paths derived from the model's weight names:
input_embedding_name(),output_embedding_name(),layer_block_name(),final_norm_name()
Find them on the model's HuggingFace page → "Files info" → safetensors structure (example: Llama-3.1-8B-Instruct).
See example: llama_model_descriptor.py
Rotary embeddings are computed modules (not saved weights). After model sharding, they need re-initialization on the correct device/dtype.
Look in github.com/huggingface/transformers/tree/main/src/transformers/models/<model_type>/modeling_<model_type>.py for:
class.*Rotary— the rotary embedding class name and constructor argumentsself.rotary_emb— the attribute path
See example: llama_model_descriptor.py