[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844
Closed
mlukasze wants to merge 1 commit into
Closed
[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844mlukasze wants to merge 1 commit into
mlukasze wants to merge 1 commit into
Conversation
Adds support for nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (NemotronH) hybrid Mamba2 + Attention + MoE model export to OpenVINO IR. Changes: - _ov_ops.py: Add convert_recurrent_mamba2_cell() for Mamba2 SSM recurrence (arXiv:2405.21060) via ov::Loop with ModuleExtension - input_generators.py: Add NemotronHDummyPastKeyValuesGenerator for conv_state + recurrent_state (mamba layers) + KV cache (attn layers) - model_configs.py: Add NemotronHOpenVINOConfig registered for 'nemotron_h' - model_patcher.py: Add NemotronHModelPatcher with Mamba2RecurrentCell, nemotron_h_mamba_mixer_forward, patched_nemotron_h_moe_forward - utils.py: Add 'nemotron_h' to SSM_MODELS list - modeling_decoder.py: Add NemotronH cache init (conv_state shape uses mamba_num_heads * mamba_head_dim, not mamba_expand * hidden_size) - tests: Add nemotron_h to SUPPORTED_SSM_ARCHITECTURES The exported IR uses ov::Loop for the Mamba2 recurrence. The FuseMamba2Loop pass in OV (PR #36412) will fuse the loop into ov::op::internal::Mamba2 at inference time (after that PR merges). Resolves: omega issue huggingface#44 References: openvinotoolkit/openvino#36412, openvinotoolkit/openvino#36372 Based on: PR huggingface#1789 by openvino-agent AI Assistance: yes - OMEGA agent (REQUIRES HUMAN REVIEW)
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds OpenVINO export support for NemotronH (
nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16), a hybrid Mamba2 + Attention + MoE model (Mamba2 SSM: arXiv:2405.21060).This PR is based on the work in PR #1789 by @openvino-agent.
Export command
optimum-cli export openvino -m nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --trust-remote-code NVIDIA-Nemotron-3-Nano-4B-BF16-ovWhat's added
_ov_ops.pyNew
convert_recurrent_mamba2_cell()conversion rule forMamba2RecurrentCellOp:ov::Loopstate_t = state_{t-1} * dA_t + dBx_t,y_t = sum(state_t * C_t, N)ov::op::internal::Mamba2at inference timeinput_generators.pyNew
NemotronHDummyPastKeyValuesGenerator:[conv_state, recurrent_state] × num_mamba + [key, value] × num_attn[B, mamba_num_heads, mamba_head_dim, ssm_state_size]model_configs.pyNew
NemotronHOpenVINOConfigregistered formodel_type='nemotron_h'model_patcher.pyNew
NemotronHModelPatcherwith:Mamba2RecurrentCelltorch module (ModuleExtension target)nemotron_h_mamba_mixer_forward()– recurrent form of the Mamba2 mixerpatched_nemotron_h_moe_forward()– vectorized MoE block for traceable exportNemotronHCacheWrap– lightweight cache wrapper for hybrid state managementutils.pyAdded
"nemotron_h"toSSM_MODELSlistmodeling_decoder.pyAdded NemotronH branch in
OVModelWithMambaForCausalLM.__init__()for correct cache shape derivation (usesmamba_num_heads * mamba_head_dimfor conv/recurrent state shapes)Architecture note
The exported IR uses
ov::Loopfor the Mamba2 recurrence. TheFuseMamba2Looptransformation in OpenVINO core (PR #36412) will fuse the loop intoov::op::internal::Mamba2at inference time, enabling optimized CPU/GPU execution.Known limitations
fast_draft_strategy.cpp)Mamba2op to be mergedTest model
The test model
optimum-intel-internal-testing/tiny-random-nemotron-hneeds to be created and pushed by an Intel team member with a CUDA-capable machine (NemotronH's Mamba2 mixer requires CUDA for init).Related PRs
Mamba2Core op +FuseMamba2LooptransformationMamba2Before submitting