[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support by mlukasze · Pull Request #1844 · huggingface/optimum-intel

mlukasze · 2026-07-01T07:52:57Z

⚠️ AUTOMATICALLY GENERATED BY OMEGA AGENT — REQUIRES HUMAN REVIEW ⚠️
This PR was created by an AI agent as part of automated model enablement.
A human maintainer must review and approve it before it can be considered for merge.
Do NOT merge without human review and sign-off.

What does this PR do?

Adds OpenVINO export support for NemotronH (nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16), a hybrid Mamba2 + Attention + MoE model (Mamba2 SSM: arXiv:2405.21060).

This PR is based on the work in PR #1789 by @openvino-agent.

Export command

optimum-cli export openvino -m nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --trust-remote-code NVIDIA-Nemotron-3-Nano-4B-BF16-ov

What's added

`_ov_ops.py`

New convert_recurrent_mamba2_cell() conversion rule for Mamba2RecurrentCellOp:

Converts the Mamba2 single-step SSM recurrence to ov::Loop
Implements: state_t = state_{t-1} * dA_t + dBx_t, y_t = sum(state_t * C_t, N)
The FuseMamba2Loop pass in OV (#36412) will fuse this loop into ov::op::internal::Mamba2 at inference time

`input_generators.py`

New NemotronHDummyPastKeyValuesGenerator:

Generates flat list of cache tensors: [conv_state, recurrent_state] × num_mamba + [key, value] × num_attn
Recurrent state shape: [B, mamba_num_heads, mamba_head_dim, ssm_state_size]

`model_configs.py`

New NemotronHOpenVINOConfig registered for model_type='nemotron_h'

`model_patcher.py`

New NemotronHModelPatcher with:

Mamba2RecurrentCell torch module (ModuleExtension target)
nemotron_h_mamba_mixer_forward() – recurrent form of the Mamba2 mixer
patched_nemotron_h_moe_forward() – vectorized MoE block for traceable export
NemotronHCacheWrap – lightweight cache wrapper for hybrid state management

`utils.py`

Added "nemotron_h" to SSM_MODELS list

`modeling_decoder.py`

Added NemotronH branch in OVModelWithMambaForCausalLM.__init__() for correct cache shape derivation (uses mamba_num_heads * mamba_head_dim for conv/recurrent state shapes)

Architecture note

The exported IR uses ov::Loop for the Mamba2 recurrence. The FuseMamba2Loop transformation in OpenVINO core (PR #36412) will fuse the loop into ov::op::internal::Mamba2 at inference time, enabling optimized CPU/GPU execution.

Known limitations

Stateful speculative decoding is blocked for NemotronH (hard assertion in GenAI fast_draft_strategy.cpp)
Beam search is disabled for hybrid models (CVS-177964)
GPU support requires OV Mamba2 op to be merged

Test model

The test model optimum-intel-internal-testing/tiny-random-nemotron-h needs to be created and pushed by an Intel team member with a CUDA-capable machine (NemotronH's Mamba2 mixer requires CUDA for init).

Related PRs

[POC] [DO NOT REVIEW] Mamba2 Fusion openvinotoolkit/openvino#36412 — Mamba2 Core op + FuseMamba2Loop transformation
[Spec] Add specification for Internal Mamba2 Operation openvinotoolkit/openvino#36372 — RST spec for Mamba2
openvinotoolkit/omega#44 — OMEGA tracking issue

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Adds support for nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (NemotronH) hybrid Mamba2 + Attention + MoE model export to OpenVINO IR. Changes: - _ov_ops.py: Add convert_recurrent_mamba2_cell() for Mamba2 SSM recurrence (arXiv:2405.21060) via ov::Loop with ModuleExtension - input_generators.py: Add NemotronHDummyPastKeyValuesGenerator for conv_state + recurrent_state (mamba layers) + KV cache (attn layers) - model_configs.py: Add NemotronHOpenVINOConfig registered for 'nemotron_h' - model_patcher.py: Add NemotronHModelPatcher with Mamba2RecurrentCell, nemotron_h_mamba_mixer_forward, patched_nemotron_h_moe_forward - utils.py: Add 'nemotron_h' to SSM_MODELS list - modeling_decoder.py: Add NemotronH cache init (conv_state shape uses mamba_num_heads * mamba_head_dim, not mamba_expand * hidden_size) - tests: Add nemotron_h to SUPPORTED_SSM_ARCHITECTURES The exported IR uses ov::Loop for the Mamba2 recurrence. The FuseMamba2Loop pass in OV (PR #36412) will fuse the loop into ov::op::internal::Mamba2 at inference time (after that PR merges). Resolves: omega issue huggingface#44 References: openvinotoolkit/openvino#36412, openvinotoolkit/openvino#36372 Based on: PR huggingface#1789 by openvino-agent AI Assistance: yes - OMEGA agent (REQUIRES HUMAN REVIEW)

rkazants · 2026-07-01T07:55:54Z

#1789

rkazants closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844

[EXPERIMENT][WIP][OpenVINO] Add NemotronH (Nemotron-3 Mamba2) support#1844
mlukasze wants to merge 1 commit into
huggingface:mainfrom
mlukasze:feat/mamba2-stateful-paged

mlukasze commented Jul 1, 2026

Uh oh!

rkazants commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mlukasze commented Jul 1, 2026

What does this PR do?

Export command

What's added

_ov_ops.py

input_generators.py

model_configs.py

model_patcher.py

utils.py

modeling_decoder.py

Architecture note

Known limitations

Test model

Related PRs

Before submitting

Uh oh!

rkazants commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`_ov_ops.py`

`input_generators.py`

`model_configs.py`

`model_patcher.py`

`utils.py`

`modeling_decoder.py`