Skip to content

Commit 08a963c

Browse files
authored
[None][feat] Add AD custom model for MiniMax-M2 family (#243)
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
1 parent fb2bf31 commit 08a963c

6 files changed

Lines changed: 1191 additions & 79 deletions

File tree

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
11
# MiniMax-M2 - override model dtype and attention backend for AutoDeploy
22
attn_backend: flashinfer
3+
# Disable fuse_finegrained_fp8_moe: the trtllm fused MoE kernel fails with
4+
# NVRTC compilation error for MiniMax-M2's MoE config (256 experts, block-wise FP8).
5+
# Use torch-simple compile backend since the Triton MoE fallback is not
6+
# CUDA-graph-capturable.
7+
compile_backend: torch-simple
8+
transforms:
9+
fuse_finegrained_fp8_moe:
10+
enabled: false
311
model_kwargs:
412
torch_dtype: bfloat16

examples/auto_deploy/model_registry/models.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ models:
229229
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']
230230
# --- MiniMax-M2.5 (Feb 2026) ---
231231
- name: MiniMaxAI/MiniMax-M2.5
232-
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']
232+
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml', 'minimax_m2.yaml']
233233
# --- MiMo-V2-Flash (Feb 2026) ---
234234
- name: XiaomiMiMo/MiMo-V2-Flash
235235
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']

tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
from .modeling_internlm3 import InternLM3ForCausalLM
1616
from .modeling_kimi_k2 import KimiK2ForCausalLM, KimiK25ForConditionalGeneration
1717
from .modeling_llama3 import Llama3ForCausalLM
18+
from .modeling_minimax_m2 import MiniMaxM2ForCausalLM
1819
from .modeling_mistral import MistralForCausalLM
1920
from .modeling_mistral3 import Mistral3ForConditionalGeneration, Mistral3TextForCausalLM
2021
from .modeling_nemotron_flash import NemotronFlashForCausalLM, NemotronFlashPreTrainedTokenizerFast
@@ -55,6 +56,7 @@
5556
"KimiK2ForCausalLM",
5657
"KimiK25ForConditionalGeneration",
5758
"Llama3ForCausalLM",
59+
"MiniMaxM2ForCausalLM",
5860
"MistralForCausalLM",
5961
"Mistral3ForConditionalGeneration",
6062
"Mistral3TextForCausalLM",

0 commit comments

Comments
 (0)