[None][feat] Add AD custom model for MiniMax-M2 family#243
Conversation
lucaslie
left a comment
There was a problem hiding this comment.
please rebase and run again. I believe a fix for skipping the finegrained fp8 fusion is merged by now that skips when on Hopper GPU or earlier. Please confirm this and if so re-run build_and_run_ad and report back
Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
[AGENT] Rebased onto latest NVRTC issue is NOT fixed — the I checked the diff between our branch and Workaround run (full 62 layers) succeeded with coherent generation: python examples/auto_deploy/build_and_run_ad.py \
--model MiniMaxAI/MiniMax-M2 \
--use-registry \
--args.transforms.fuse_finegrained_fp8_moe.enabled false \
--args.compile-backend torch-simpleAll 10 test prompts produce correct, coherent responses. The custom model code itself is working correctly — only the Should we wait for the Hopper skip fix to be merged, or is this PR ready to merge as-is with the documented workaround? |
|
Please incorporate the disablement of the transform into a custom config and correctly add it to the model registry. Then come back after re-running build_and_run_ad and post RAW LOGS FOR ALL PROMPTS + OUTPUTS PRODUCED BY THE MODEL |
…ry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
|
[AGENT] Updated python examples/auto_deploy/build_and_run_ad.py --model MiniMaxAI/MiniMax-M2 --use-registryResult: SUCCESS — full 62 layers, 8x H100, ~281s transform time, ~42s inference for 10 prompts. Registry config applied automatically: attn_backend: flashinfer
compile_backend: torch-simple
transforms:
fuse_finegrained_fp8_moe:
enabled: false
model_kwargs:
torch_dtype: bfloat16Raw Logs — All Prompts + Outputs[PROMPT 0] User: "How big is the universe?" [PROMPT 1] User: "In simple words and a single sentence, explain the concept of gravity:" [PROMPT 2] User: "How to fix slicing in golf?" [PROMPT 3] User: "Where is the capital of Iceland?" [PROMPT 4] User: "What are the three laws of thermodynamics?" [PROMPT 5] User: "Summarize the plot of Romeo and Juliet in two sentences:" [PROMPT 6] User: "Write a Python function that checks if a number is prime." [PROMPT 7] User: "Explain the difference between a compiler and an interpreter:" [PROMPT 8] User: "What causes the northern lights?" [PROMPT 9] User: "What are the health benefits of drinking green tea?" All 10 prompts produce coherent, on-topic, structured chain-of-thought reasoning. |
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* [None][feat] Add AD custom model for MiniMax-M2 family Replace the existing MiniMax-M2 MoE patch with a full custom model implementation using AD canonical ops. Covers both MiniMaxAI/MiniMax-M2 and MiniMaxAI/MiniMax-M2.5 (same architecture, model_type: minimax_m2). Key architecture features: - MoE with 256 experts, top-8, sigmoid routing + e_score_correction_bias - GQA (48 Q heads, 8 KV heads, head_dim=128) - Partial RoPE (rotary_dim=64 out of head_dim=128) - Per-layer QK normalization (RMSNorm on full Q/K before reshape) - FP8 block-wise quantized checkpoint Canonical ops used: torch_rmsnorm, torch_rope_with_explicit_cos_sin, torch_attention (GQA-native, no repeat_kv), torch_moe. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * [None][fix] Disable fuse_finegrained_fp8_moe for MiniMax-M2 in registry config The trtllm fused MoE kernel fails with NVRTC compilation error for MiniMax-M2's MoE configuration (256 experts, block-wise FP8). Add transform disablement and torch-simple compile backend to the model registry config so --use-registry works out of the box. Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Summary
patches/minimax_m2.py) with a full custom model implementation using AD canonical opsmodel_type: minimax_m2,MiniMaxM2ForCausalLM)minimax_m2.yamlconfig to M2.5 entryArchitecture
e_score_correction_bias(noaux_tc)Canonical ops used
torch_rmsnorm— RMSNorm and QK normtorch_rope_with_explicit_cos_sin— Partial RoPE (split, apply, concat)torch_attention— GQA-native (no repeat_kv)torch_moe— Expert computation (sigmoid routing in plain PyTorch)Files changed
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.pytests/unittest/auto_deploy/singlegpu/models/test_minimax_m2_modeling.pytensorrt_llm/_torch/auto_deploy/models/custom/__init__.pyexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/models/patches/minimax_m2.pyAD End-to-End Run
Status: SUCCESS with full 62 layers on 8x H100 GPUs.
Generation quality is coherent and correct across all 10 test prompts (factual questions, code generation, summarization).
Note: The
FuseFineGrainedFP8Moetransform causes an NVRTC compilation failure on H100 with this model's MoE configuration (256 experts, block-wise FP8). This is a pre-existing backend infrastructure issue unrelated to the custom model code. Workaround used for the test run:python examples/auto_deploy/build_and_run_ad.py \ --model MiniMaxAI/MiniMax-M2 \ --use-registry \ --args.transforms.fuse_finegrained_fp8_moe.enabled false \ --args.compile-backend torch-simpleTo reproduce with default settings (once NVRTC issue is resolved):
Test plan
Run unit tests:
Tests include:
🤖 Generated with Claude Code