Skip to content

mtmd : add Nemotron 3 Nano Omni support (parakeet)#22520

Open
danbev wants to merge 34 commits into
ggml-org:masterfrom
danbev:nemotron-3-omni-mtmd-audio
Open

mtmd : add Nemotron 3 Nano Omni support (parakeet)#22520
danbev wants to merge 34 commits into
ggml-org:masterfrom
danbev:nemotron-3-omni-mtmd-audio

Conversation

@danbev

@danbev danbev commented Apr 29, 2026

Copy link
Copy Markdown
Member

Overview

This commit adds support for the subsampling and encoder part of Nemotron Nemo 3 omni model.

Additional information

The Parakeet subsampling/encoder were taken from parakeet.cpp which is currently a pull request against whisper.cpp. I've tried to copy the code as close as possible to hopefully enable easy patching between these two project later.

Refs: ggml-org/whisper.cpp#3735


For testing a converted model can be found here and can be run using the following command:

llama-mtmd-cli -hf danbev/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16-mtmd-GGUF --no-warmup --audio jfk.wav -p "Transcribe this audio clip, only the trancription and nothing else."

This commit adds support for the subsampling and encoder part of
Nemotron Nemo 3 omni model.

The Parakeet subsampling/encoder were taken from parakeet.cpp which
is currently a pull request against whisper.cpp. I've tried to copy the
code a close as possible to hopefully enable easy patching between the
these two project later.

Refs: ggml-org/whisper.cpp#3735

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I'm leaving some early-review comments

Comment thread convert_hf_to_gguf.py Outdated
Comment thread tools/mtmd/mtmd-audio.cpp Outdated
@github-actions github-actions Bot added examples python python script changes labels Apr 29, 2026
This commit removes the generation of the relative positional tensor in
the model conversion script and instead computes it in the encoder
graph. This is only done for the window of positions required for the
current audio sample.
Comment thread tools/mtmd/clip.h Outdated
Comment thread tools/mtmd/mtmd-audio.cpp Outdated
danbev added 2 commits April 30, 2026 14:50
This commit adds a function to get access to the clip_model. It also
removes the two functions clip_get_mel_filter_tensor, and
clip_get_window_tensor(const struct clip_ctx * ctx) which can now use
clip_get_model to access the model tensors that it needs.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far

Comment thread tools/mtmd/mtmd-audio.cpp Outdated
Comment thread tools/mtmd/clip.cpp Outdated
@danbev danbev self-assigned this Jun 2, 2026
danbev added 5 commits June 4, 2026 12:50
This commit updates the parakeet code in mtmd to reflect the latest
updates to parakeet.cpp in whisper.cpp.

A follow up commit will address the currently hardcoded dw_pad and see
if we can add n_conv_kernel as a model metadata field.
This commit updates the model conversion to read the conv_kernel_size
field from the sound_config section of the models config.json file.
It then uses this field instead of the hardcoded values in parakeet.cpp.
@danbev danbev marked this pull request as ready for review June 18, 2026 03:41
@danbev danbev requested review from a team and CISC as code owners June 18, 2026 03:41
Comment thread conversion/nemotron.py Outdated
Comment thread tools/mtmd/models/parakeet.cpp Outdated
Comment thread tools/mtmd/models/parakeet.cpp Outdated
Comment thread tools/mtmd/models/parakeet.cpp Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side question: is it possible to use the 4th dm of the input as batch dim? (provided that all inputs are the same size - no padding or masking is need)

that may allow batching support in the future, but it's optional

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but I'll take a look 👍

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction: I meant 3rd dim; input tensor shape is: [nx, ny, n_batch]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this and this should be possible with some changes. I'd be happy to make these changes in a follow up PR when we add batch support, and I'll try to also update parakeet.cpp then to keep both as aligned as possible.

Comment thread tools/mtmd/clip.cpp Outdated
@CISC

CISC commented Jun 18, 2026

Copy link
Copy Markdown
Member

@danbev Glad midsommar! :)

Comment thread tools/mtmd/clip.cpp Outdated
@danbev

danbev commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

Glad midsommar! :)

@CISC Tack! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants