Skip to content

Commit 2a18bd4

Browse files
authored
[None][feat] BREAKING: Add MiniMax-M3 PyTorch backend bring-up with API changes (#15292)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com>
1 parent 4a8b7af commit 2a18bd4

30 files changed

Lines changed: 11125 additions & 76 deletions

File tree

docs/source/models/supported-models.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ The following is a table of supported models for the PyTorch backend:
3232
| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B` |
3333
| `Llama4ForConditionalGeneration` | Llama 4 | `meta-llama/Llama-4-Scout-17B-16E-Instruct` |
3434
| `MiniMaxM2ForCausalLM` [^5] | MiniMax M2/M2.1/M2.7 | `MiniMaxAI/MiniMax-M2.7` |
35+
| `MiniMaxM3SparseForConditionalGeneration` [^11]| MiniMax-M3 | `MiniMaxAI/MiniMax-M3` |
3536
| `MistralForCausalLM` | Mistral | `mistralai/Mistral-7B-v0.1` |
3637
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` |
3738
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` |
@@ -72,6 +73,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
7273
| `NemotronHForCausalLM` | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | N/A | Untested | Untested |
7374
| `Gemma4ForConditionalGeneration` | Untested | Yes | Untested | No | Yes | No | No | No | No | Yes | Untested | No | Yes | Untested | Untested |
7475
| `Step3p7ForConditionalGeneration`| Yes | Yes | Yes | Untested | Untested | Yes | No | No | No | Yes | Untested | Untested | Yes | Untested | Untested |
76+
| `MiniMaxM3SparseForConditionalGeneration` [^11] | Yes | Yes | Yes | Untested | Untested | No | No | No | No | Yes | Untested | No | N/A | Untested | Untested |
7577

7678
[^1]: Chunked Prefill for MLA can only be enabled on SM100/SM103.
7779
[^2]: KV cache reuse for MLA can only be enabled on SM90/SM100/SM103 and in BF16/FP8 KV cache dtype.
@@ -82,6 +84,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
8284
[^8]: Supports text and image inputs. The vision tower runs in BF16 even when the text decoder is quantized (FP8 block-scale or NVFP4). The text decoder is also usable standalone (text-only) via the `Step3p5ForCausalLM` architecture.
8385
[^9]: Audio modality only supported on E2B/E4B variants.
8486
[^10]: Audio requires a checkpoint with a `sound_config` and is supported only on the full (non-disaggregated) model path, not the EPD disaggregated path.
87+
[^11]: Supports text, image, and video inputs over the block-sparse attention path. The published MXFP8 checkpoint is dequantized on load so the runtime sees an effectively BF16 model. The text decoder is also usable standalone (text-only) via the `MiniMaxM3SparseForCausalLM` architecture. KV cache reuse and MTP are not supported on the sparse-attention path in this release.
8588

8689
# Multimodal Feature Support Matrix (PyTorch Backend)
8790

@@ -102,6 +105,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
102105
| `Qwen3VLForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | L + I + V |
103106
| `Qwen3VLMoeForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | L + I + V |
104107
| `Step3p7ForConditionalGeneration` | Yes | Yes | Untested | Yes | Untested | Untested | Untested | Untested | L + I |
108+
| `MiniMaxM3SparseForConditionalGeneration` [^11] | Yes | Yes | Untested | Yes | Untested | No | Untested | Untested | L + I + V |
105109

106110
Note:
107111
- L: Language

tensorrt_llm/_torch/attention_backend/sparse/__init__.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,13 @@
1+
# yapf: disable
2+
from .minimax_m3 import (MiniMaxM3SparseAttention,
3+
MiniMaxM3SparseAttentionMetadata,
4+
MiniMaxM3SparseConfig, MiniMaxM3SparseIndexCache,
5+
allocate_minimax_m3_static_buffers,
6+
build_runtime_metadata_from_kv_manager,
7+
get_minimax_m3_attention_backend_cls,
8+
get_minimax_m3_kv_cache_manager_cls,
9+
minimax_m3_sparse_decode, minimax_m3_sparse_prefill)
10+
# yapf: enable
111
from .utils import (get_flashinfer_sparse_attn_attention_backend,
212
get_sparse_attn_kv_cache_manager,
313
get_trtllm_sparse_attn_attention_backend,
@@ -8,4 +18,14 @@
818
"get_vanilla_sparse_attn_attention_backend",
919
"get_trtllm_sparse_attn_attention_backend",
1020
"get_flashinfer_sparse_attn_attention_backend",
21+
"MiniMaxM3SparseAttention",
22+
"MiniMaxM3SparseAttentionMetadata",
23+
"MiniMaxM3SparseConfig",
24+
"MiniMaxM3SparseIndexCache",
25+
"allocate_minimax_m3_static_buffers",
26+
"build_runtime_metadata_from_kv_manager",
27+
"get_minimax_m3_attention_backend_cls",
28+
"get_minimax_m3_kv_cache_manager_cls",
29+
"minimax_m3_sparse_decode",
30+
"minimax_m3_sparse_prefill",
1131
]
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
"""MiniMax-M3 sparse attention package.
4+
5+
Layered as:
6+
7+
* :mod:`.kernels` -- OpenAI Triton kernels (per-block max
8+
score, masked softmax for sparse GQA).
9+
* :mod:`.metadata` -- ``MiniMaxM3SparseConfig`` /
10+
``MiniMaxM3SparseAttentionMetadata``
11+
dataclasses, CUDA-graph-stable buffer
12+
allocator + builder, and the
13+
:class:`AttentionMetadata` subclass
14+
factory.
15+
* :mod:`.cache_manager` -- standalone side index cache used by tests
16+
and the :class:`KVCacheManagerV2`
17+
subclass factory.
18+
* :mod:`.backend` -- the algorithm itself (vectorized
19+
paged-cache helpers, prefill / decode
20+
entry points, the thin
21+
:class:`MiniMaxM3SparseAttention`
22+
orchestrator) and the
23+
:class:`AttentionBackend` subclass
24+
factory.
25+
26+
This package's public surface re-exports the names callers
27+
historically imported from ``...sparse.minimax_m3`` so external
28+
importers (the model code, ``sparse.utils``, focused tests) keep
29+
working unchanged.
30+
"""
31+
32+
# Re-export the algorithm-internal helpers focused unit tests reach
33+
# into so the package preserves the surface the monolithic module
34+
# exposed. These are not part of ``__all__`` (still package-private)
35+
# but stay importable as ``from ...minimax_m3 import _write_main_kv_slots``.
36+
from .backend import ( # noqa: F401
37+
MiniMaxM3SparseAttention,
38+
_compute_index_attn_chunk_q,
39+
_compute_sparse_gqa_chunk_q,
40+
_gather_paged_batched,
41+
_index_attention_and_select,
42+
_write_main_kv_slots,
43+
_write_main_kv_slots_to_pool,
44+
get_minimax_m3_attention_backend_cls,
45+
minimax_m3_sparse_decode,
46+
minimax_m3_sparse_prefill,
47+
)
48+
from .cache_manager import (
49+
MiniMaxM3KVCacheManagerV2,
50+
MiniMaxM3SparseIndexCache,
51+
get_minimax_m3_kv_cache_manager_cls,
52+
)
53+
from .metadata import (
54+
MiniMaxM3SparseAttentionMetadata,
55+
MiniMaxM3SparseConfig,
56+
allocate_minimax_m3_static_buffers,
57+
build_runtime_metadata_from_kv_manager,
58+
get_minimax_m3_attention_metadata_cls,
59+
replace_metadata,
60+
)
61+
62+
__all__ = [
63+
"MiniMaxM3KVCacheManagerV2",
64+
"MiniMaxM3SparseAttention",
65+
"MiniMaxM3SparseAttentionMetadata",
66+
"MiniMaxM3SparseConfig",
67+
"MiniMaxM3SparseIndexCache",
68+
"allocate_minimax_m3_static_buffers",
69+
"build_runtime_metadata_from_kv_manager",
70+
"get_minimax_m3_attention_backend_cls",
71+
"get_minimax_m3_attention_metadata_cls",
72+
"get_minimax_m3_kv_cache_manager_cls",
73+
"minimax_m3_sparse_decode",
74+
"minimax_m3_sparse_prefill",
75+
"replace_metadata",
76+
]

0 commit comments

Comments
 (0)