Skip to content

Commit 86b8b2f

Browse files
lucasliebmarimuthu-nv
authored andcommitted
[None][feat] Add AD custom model for MiniMax-M2 family (#243)
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
1 parent aa76fe5 commit 86b8b2f

6 files changed

Lines changed: 1191 additions & 79 deletions

File tree

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
11
# MiniMax-M2 - override model dtype and attention backend for AutoDeploy
22
attn_backend: flashinfer
3+
# Disable fuse_finegrained_fp8_moe: the trtllm fused MoE kernel fails with
4+
# NVRTC compilation error for MiniMax-M2's MoE config (256 experts, block-wise FP8).
5+
# Use torch-simple compile backend since the Triton MoE fallback is not
6+
# CUDA-graph-capturable.
7+
compile_backend: torch-simple
8+
transforms:
9+
fuse_finegrained_fp8_moe:
10+
enabled: false
311
model_kwargs:
412
torch_dtype: bfloat16

examples/auto_deploy/model_registry/models.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ models:
229229
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']
230230
# --- MiniMax-M2.5 (Feb 2026) ---
231231
- name: MiniMaxAI/MiniMax-M2.5
232-
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']
232+
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml', 'minimax_m2.yaml']
233233
# --- MiMo-V2-Flash (Feb 2026) ---
234234
- name: XiaomiMiMo/MiMo-V2-Flash
235235
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']

tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
from .modeling_internlm3 import InternLM3ForCausalLM
1515
from .modeling_kimi_k2 import KimiK2ForCausalLM, KimiK25ForConditionalGeneration
1616
from .modeling_llama3 import Llama3ForCausalLM
17+
from .modeling_minimax_m2 import MiniMaxM2ForCausalLM
1718
from .modeling_mistral import MistralForCausalLM
1819
from .modeling_mistral3 import Mistral3ForConditionalGeneration, Mistral3TextForCausalLM
1920
from .modeling_nemotron_flash import NemotronFlashForCausalLM, NemotronFlashPreTrainedTokenizerFast
@@ -48,6 +49,7 @@
4849
"KimiK2ForCausalLM",
4950
"KimiK25ForConditionalGeneration",
5051
"Llama3ForCausalLM",
52+
"MiniMaxM2ForCausalLM",
5153
"MistralForCausalLM",
5254
"Mistral3ForConditionalGeneration",
5355
"Mistral3TextForCausalLM",

0 commit comments

Comments
 (0)