Qwen3Next MTP for vLLM plugin mode#772
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for running Qwen3Next MTP (multi-token prediction / EAGLE-style speculative decoding) under vLLM plugin mode, including draft-model construction, KV-cache indexing fixes, and attention/metadata handling for multi-token verification.
Changes:
- Register
Qwen3NextMTPfor vLLM plugin mode and add model-class routing to the ATOM vLLM wrapper. - Teach the vLLM wrapper to detect draft-model construction, load draft weights correctly (
spec_decode=True), and swap the globalatom_configduringforward()to keep layer lookups consistent across target/draft alternation. - Update plugin attention metadata + paged attention implementations to correctly handle multi-token decode layouts used by MTP/EAGLE.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
atom/plugin/vllm/register.py |
Registers Qwen3NextMTP architecture override for vLLM plugin mode. |
atom/plugin/vllm/model_wrapper.py |
Detects draft vs target, routes draft architecture, swaps/restores global atom_config for forwards, and passes spec_decode into weight loading. |
atom/plugin/vllm/attention_backend/attention_gdn.py |
Fixes GDN attention output writeback for speculative decode and adjusts imports/code paths. |
atom/plugin/attention.py |
Adjusts attention metadata builder thresholds/logic for MTP/EAGLE multi-token verification and async spec-decode metadata. |
atom/plugin/attention_mha.py |
Updates paged-attention decode kernels and buffer sizing to support MTP multi-token decode layout; fixes extend block-table slicing. |
atom/models/qwen3_next.py |
Adds explicit layer_num for attention KV slot isolation in MTP, fixes speculative_config fallback for vLLM, and exposes embed_tokens for sharing. |
atom/models/qwen3_next_mtp.py |
Implements Qwen3Next MTP draft model with correct layer indexing, quant prefixing, and expert mapping for shared-expert fusion. |
atom/model_loader/loader.py |
Plumbs spec_decode through plugin-mode loading so draft models can load mtp.* weights and apply MTP remapping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # `kernel_size-1 + num_spec` rows per slot and the extra row spilled | ||
| # into the page-adjacent ssm_state, corrupting layer 0's recurrent | ||
| # state. Pull the spec config from the vLLM config as a fallback. | ||
| if is_vllm() and self.speculative_config is None: |
Signed-off-by: ganyi <ygan@amd.com>
f38481f to
3af7ccb
Compare
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
atom/plugin/vllm/model_wrapper.py:421
- The draft-model detection in load_weights checks for "Qwen3NextMTP", but other parts of the repo still use the architecture key "Qwen3NextMTPModel" for Qwen3-Next MTP. If the draft model’s HF config reports "Qwen3NextMTPModel", spec_decode-specific loading (hf_config_override / weight filtering) won’t activate. Please align this set with the actual HF architecture string used for the draft model.
is_mtp_draft_model = self.model_arch in {
"DeepSeekMTPModel",
"Qwen3NextMTP",
}
| main_model_arch = vllm_config.model_config.architectures[0] | ||
| model_arch = _select_model_arch(vllm_config) | ||
| self.is_mtp_draft_model = self.is_mtp and model_arch != main_model_arch | ||
| if self.is_mtp_draft_model: | ||
| self.atom_config = get_current_atom_config() | ||
| else: | ||
| self.atom_config = generate_atom_config_for_plugin_mode(vllm_config) | ||
| self.model_arch = model_arch | ||
| _prepare_env(atom_config=self.atom_config) |
|
Hi @ganyi1996ppo |
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
atom/plugin/vllm/model_wrapper.py:196
_expose_spec_decode_attrs()is now only executed whenmodel_arch in _MTP_MASK_INPUT_ARCH(currently onlyDeepSeekMTPModel). The newQwen3NextMTPmodel has the same extra.modelnesting and does not exposeembed_tokens/layerson the outer module, so vLLM speculative decoding weight/embedding sharing is likely to fail. Suggest calling_expose_spec_decode_attrs()for all MTP draft models that wrap an inner.model(and keep_adapt_mtp_layers_for_vllm()gated separately if it’s DeepSeek-specific), or addQwen3NextMTPto the relevant allowlist.
logger.info(f"Construct ATOM model {model_arch} for vLLM plugin mode")
self.model = model_cls(self.atom_config)
if model_arch in _MTP_MASK_INPUT_ARCH:
self._adapt_mtp_layers_for_vllm()
# Mirror nested attributes required by vLLM speculative decoding.
self._expose_spec_decode_attrs()
atom/plugin/vllm/model_wrapper.py:422
- Draft-model detection only checks
self.model_archagainst{ "DeepSeekMTPModel", "Qwen3NextMTP" }. If the HF draft config still reportsQwen3NextMTPModel(as referenced elsewhere in the repo), this branch won’t treat it as spec-decode, andhf_config_overridewon’t be applied. Consider accepting bothQwen3NextMTPandQwen3NextMTPModelhere (and in_ATOM_MODEL_CLASSES) so both draft-arch spellings work.
is_mtp_draft_model = self.model_arch in {
"DeepSeekMTPModel",
"Qwen3NextMTP",
}
| self.vllm_config = vllm_config | ||
| self.atom_config = generate_atom_config_for_plugin_mode(vllm_config) | ||
| self.is_mtp = False | ||
| speculative_config = getattr(vllm_config, "speculative_config", None) | ||
| if speculative_config is not None: | ||
| spec_method = speculative_config.method | ||
| self.is_mtp = spec_method == "mtp" | ||
|
|
||
| _prepare_env(atom_config=self.atom_config) | ||
|
|
||
| main_model_arch = vllm_config.model_config.architectures[0] | ||
| model_arch = _select_model_arch(vllm_config) | ||
| self.is_mtp_draft_model = self.is_mtp and model_arch != main_model_arch | ||
| if self.is_mtp_draft_model: | ||
| self.atom_config = get_current_atom_config() | ||
| else: | ||
| self.atom_config = generate_atom_config_for_plugin_mode(vllm_config) |
| "GlmMoeDsaForCausalLM": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, | ||
| "DeepSeekMTPModel": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, | ||
| "Qwen3NextForCausalLM": "atom.models.qwen3_next:Qwen3NextForCausalLMVllm", | ||
| "Qwen3NextMTP": ATOM_MOE_CAUSAL_LM_MODEL_WRAPPER, |
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
recipes/atom_vllm/Qwen3.5.md:137
- The "Key Environment Variables" list no longer includes
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1, but the earlier text still refers to three required variables. Please ensure this section stays consistent with the intended required/optional env var set for Qwen3.5.
## Key Environment Variables
- `ATOM_USE_CUSTOM_ALL_GATHER=0`: **Required** - disables custom all-gather for compatibility with Qwen3.5 model architecture
- `AITER_QUICK_REDUCE_QUANTIZATION=INT4`: **Performance optimization** - enables INT4 quantization for quick reduce operations
- **Benefit**: Significantly improves TTFT (Time To First Token) performance by reducing communication overhead during tensor parallelism all-reduce operations
| self._expose_spec_decode_attrs() | ||
|
|
||
| if model_arch in _MTP_MASK_INPUT_ARCH: | ||
| self._adapt_mtp_layers_for_vllm() |
| **Important**: The following three environment variables are required for Qwen3.5: | ||
|
|
||
| - `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1`: Disables ATOM attention plugin to use vLLM's implementation for full attention layers (required because Qwen3.5 uses a hybrid architecture with both linear attention (GatedDeltaNet) and full attention layers) | ||
| - `ATOM_USE_CUSTOM_ALL_GATHER=0`: Disables custom all-gather for compatibility with Qwen3.5 model architecture | ||
| - `AITER_QUICK_REDUCE_QUANTIZATION=INT4`: **Performance optimization** - enables INT4 quantization for quick reduce operations, which can significantly improve TTFT (Time To First Token) performance. **Note**: This optimization may introduce a risk of accuracy degradation. For accuracy-critical workloads, consider validating with your specific use case. | ||
|
|
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
atom/plugin/vllm/model_wrapper.py:432
atom.config.SpeculativeConfigdoes not exposedraft_model_config, sodraft_model_config = getattr(self.atom_config.speculative_config, "draft_model_config", None)will always beNoneandhf_config_overridewill not be applied for MTP draft-model weight loading. This can cause the draft model to load with the target model's HF config. Useself.atom_config.speculative_config.draft_model_hf_config(or fall back toself.vllm_config.speculative_config.draft_model_config.hf_config) when buildingdraft_hf_config.
is_mtp_draft_model = self.model_arch in {
"DeepSeekMTPModel",
"Qwen3NextMTP",
}
draft_hf_config = None
if is_mtp_draft_model:
draft_model_config = getattr(
getattr(self.atom_config, "speculative_config", None),
"draft_model_config",
None,
)
if draft_model_config is not None:
draft_hf_config = getattr(
draft_model_config, "hf_config", draft_model_config
)
| # Mirror nested attributes required by vLLM speculative decoding. | ||
| self._expose_spec_decode_attrs() | ||
|
|
| **Important**: The following three environment variables are required for Qwen3.5: | ||
|
|
||
| - `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1`: Disables ATOM attention plugin to use vLLM's implementation for full attention layers (required because Qwen3.5 uses a hybrid architecture with both linear attention (GatedDeltaNet) and full attention layers) | ||
| - `ATOM_USE_CUSTOM_ALL_GATHER=0`: Disables custom all-gather for compatibility with Qwen3.5 model architecture |
| def _build_atom_speculative_config_from_vllm(vllm_spec_config: Any): | ||
| """Translate vLLM's SpeculativeConfig into ATOM's SpeculativeConfig. | ||
|
|
||
| Reuses vLLM's already-loaded draft hf_config (skips a second disk fetch | ||
| in ATOM SpeculativeConfig.__post_init__) but still runs ATOM's | ||
| hf_config_override on it — so MTP model_type remap, n_routed_experts | ||
| backfill (Qwen families), and architecture rewrite all land on the | ||
| draft config in one place. Mirrors how standalone ATOM MTP exposes | ||
| the draft hf_config via atom_config.speculative_config. | ||
|
|
||
| The draft hf_config is deepcopied first because hf_config_override | ||
| mutates `architectures` to ATOM's standalone naming (e.g. | ||
| "Qwen3NextMTPModel"), which differs from vLLM's registry name | ||
| ("Qwen3NextMTP"). Mutating in place would make vLLM's later draft | ||
| architecture lookup fail. | ||
| """ | ||
| if vllm_spec_config is None: | ||
| return None | ||
|
|
||
| from atom.config import SpeculativeConfig | ||
|
|
||
| draft_model_config = getattr(vllm_spec_config, "draft_model_config", None) | ||
| draft_hf_config = getattr(draft_model_config, "hf_config", None) | ||
| if draft_hf_config is not None: | ||
| draft_hf_config = copy.deepcopy(draft_hf_config) | ||
| model_path = getattr(draft_model_config, "model", None) or getattr( | ||
| vllm_spec_config, "model", None | ||
| ) | ||
|
|
||
| return SpeculativeConfig( | ||
| method=getattr(vllm_spec_config, "method", "") or "", | ||
| model=model_path, | ||
| num_speculative_tokens=getattr( | ||
| vllm_spec_config, "num_speculative_tokens", None | ||
| ), | ||
| draft_model_hf_config=draft_hf_config, | ||
| ) | ||
|
|
Motivation
server script:
verify script
result
Technical Details
Test Plan
Test Result
Submission Checklist