Skip to content

[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844

Closed
mlukasze wants to merge 1 commit into
huggingface:mainfrom
mlukasze:feat/mamba2-stateful-paged
Closed

[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844
mlukasze wants to merge 1 commit into
huggingface:mainfrom
mlukasze:feat/mamba2-stateful-paged

Conversation

@mlukasze

@mlukasze mlukasze commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

⚠️ AUTOMATICALLY GENERATED BY OMEGA AGENT — REQUIRES HUMAN REVIEW ⚠️
This PR was created by an AI agent as part of automated model enablement.
A human maintainer must review and approve it before it can be considered for merge.
Do NOT merge without human review and sign-off.


What does this PR do?

Adds OpenVINO export support for NemotronH (nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16), a hybrid Mamba2 + Attention + MoE model (Mamba2 SSM: arXiv:2405.21060).

This PR is based on the work in PR #1789 by @openvino-agent.

Export command

optimum-cli export openvino -m nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --trust-remote-code NVIDIA-Nemotron-3-Nano-4B-BF16-ov

What's added

_ov_ops.py

New convert_recurrent_mamba2_cell() conversion rule for Mamba2RecurrentCellOp:

  • Converts the Mamba2 single-step SSM recurrence to ov::Loop
  • Implements: state_t = state_{t-1} * dA_t + dBx_t, y_t = sum(state_t * C_t, N)
  • The FuseMamba2Loop pass in OV (#36412) will fuse this loop into ov::op::internal::Mamba2 at inference time

input_generators.py

New NemotronHDummyPastKeyValuesGenerator:

  • Generates flat list of cache tensors: [conv_state, recurrent_state] × num_mamba + [key, value] × num_attn
  • Recurrent state shape: [B, mamba_num_heads, mamba_head_dim, ssm_state_size]

model_configs.py

New NemotronHOpenVINOConfig registered for model_type='nemotron_h'

model_patcher.py

New NemotronHModelPatcher with:

  • Mamba2RecurrentCell torch module (ModuleExtension target)
  • nemotron_h_mamba_mixer_forward() – recurrent form of the Mamba2 mixer
  • patched_nemotron_h_moe_forward() – vectorized MoE block for traceable export
  • NemotronHCacheWrap – lightweight cache wrapper for hybrid state management

utils.py

Added "nemotron_h" to SSM_MODELS list

modeling_decoder.py

Added NemotronH branch in OVModelWithMambaForCausalLM.__init__() for correct cache shape derivation (uses mamba_num_heads * mamba_head_dim for conv/recurrent state shapes)

Architecture note

The exported IR uses ov::Loop for the Mamba2 recurrence. The FuseMamba2Loop transformation in OpenVINO core (PR #36412) will fuse the loop into ov::op::internal::Mamba2 at inference time, enabling optimized CPU/GPU execution.

Known limitations

  • Stateful speculative decoding is blocked for NemotronH (hard assertion in GenAI fast_draft_strategy.cpp)
  • Beam search is disabled for hybrid models (CVS-177964)
  • GPU support requires OV Mamba2 op to be merged

Test model

The test model optimum-intel-internal-testing/tiny-random-nemotron-h needs to be created and pushed by an Intel team member with a CUDA-capable machine (NemotronH's Mamba2 mixer requires CUDA for init).

Related PRs

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Adds support for nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (NemotronH)
hybrid Mamba2 + Attention + MoE model export to OpenVINO IR.

Changes:
- _ov_ops.py: Add convert_recurrent_mamba2_cell() for Mamba2 SSM
  recurrence (arXiv:2405.21060) via ov::Loop with ModuleExtension
- input_generators.py: Add NemotronHDummyPastKeyValuesGenerator
  for conv_state + recurrent_state (mamba layers) + KV cache (attn layers)
- model_configs.py: Add NemotronHOpenVINOConfig registered for 'nemotron_h'
- model_patcher.py: Add NemotronHModelPatcher with Mamba2RecurrentCell,
  nemotron_h_mamba_mixer_forward, patched_nemotron_h_moe_forward
- utils.py: Add 'nemotron_h' to SSM_MODELS list
- modeling_decoder.py: Add NemotronH cache init (conv_state shape uses
  mamba_num_heads * mamba_head_dim, not mamba_expand * hidden_size)
- tests: Add nemotron_h to SUPPORTED_SSM_ARCHITECTURES

The exported IR uses ov::Loop for the Mamba2 recurrence.
The FuseMamba2Loop pass in OV (PR #36412) will fuse the loop
into ov::op::internal::Mamba2 at inference time (after that PR merges).

Resolves: omega issue huggingface#44
References: openvinotoolkit/openvino#36412, openvinotoolkit/openvino#36372
Based on: PR huggingface#1789 by openvino-agent

AI Assistance: yes - OMEGA agent (REQUIRES HUMAN REVIEW)
@rkazants

rkazants commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

#1789

@rkazants rkazants closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants