mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#22101
mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#22101ReinforcedKnowledge wants to merge 10 commits intoggml-org:masterfrom
Conversation
f4c14e1 to
7b313dc
Compare
|
@ReinforcedKnowledge Thanks for tackling this support! I'd been slowly working through Granite 3.3 Speech support, but had stalled out badly. I'll pull this down and give it a shot on both the new 4.0-based model and the older 3.2 and 3.3 models. |
|
🤦 Nope, I'm wrong here! The 3.x speech models used the conditional adapter while the 3.x vision models did not. It appears that this swapped for 4.0 (I didn't realize speech had dropped the conditional adapter for the HF release). |
|
Confirmed that this is working nicely for 4.0 with the embedded multilingual sample from the repo: python convert_hf_to_gguf.py ~/models/ibm-granite/granite-4.0-1b-speech/ --outtype bf16
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-4.0-1b-speech/ --outtype bf16 --mmproj
./build-rel/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-4.0-1b-speech/granite-4.0-1B-speech-BF16.gguf --mmproj ~/models/ibm-granite/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-BF16.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logs |
|
This is also working nicely for 3.3-2b! Note that for that model, you do need the adapter (though interestingly it does seem to transcribe the english without the adapter before apparently translating the french to english). Convertpython convert_hf_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16
python convert_hf_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16 --mmproj
python convert_lora_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16Run with adapter./build-rel/bin/llama-mtmd-cli -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logsRun without adapter./build-rel/bin/llama-mtmd-cli -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logs |
|
@gabe-l-hart If I understand correctly, the model contains specific adapters for audio / vision input, and the adapter is only activated during prompt processing of the corresponding modality input, right? IIRC there was also a discussion about having built-in LoRA adapter (because currently adapters are loaded as separated files, which is not very convenient in terms of UX). I don't remember exactly where was the discussion, but may worth re-visit it. |
gabe-l-hart
left a comment
There was a problem hiding this comment.
Thank you SO much for putting this together! It's been on my TODO list for a very long time and just hasn't made it to the top.
I've got a number of nitty questions about things that should maybe be hparams instead of being hard-coded as well as a few structural questions for @ngxson about any future plans for further model-specific modularity in the codebase. The only concrete change request (besides the naming conventions from @ngxson) is that you update the base GraniteModel in convert_hf_to_gguf.py rather than introducing a special text model for Granite Speech.
| } break; | ||
| case PROJECTOR_TYPE_GRANITE_SPEECH: | ||
| { | ||
| hparams.audio_chunk_len = 0; |
There was a problem hiding this comment.
@ngxson I've been curious about these hard-coded values. These seem like properties of the model instance and not the model architecture and thus something that would make sense as hparam values in the GGUF for the specific model. Is there something I'm missing that explicitly links the projector architecture to these specific values? I know that the upstream transformers models hard-code them, but I would imagine it might make sense to proactively put them in the GGUF so that if in the future the architecture is reused with different values, we don't need a code-change and/or reconverted GGUFs to support it. The fields are already there in the internal clip_hparams (the ones being set here), so I think it would just be a matter of defining the string constants for conversion and then adding these as the default values in the convert_hf_to_gguf.py stack.
| } | ||
| set_input_f32("pos_emb", pos_emb); | ||
| } break; | ||
| case PROJECTOR_TYPE_GRANITE_SPEECH: |
There was a problem hiding this comment.
Question for @ngxson: Is there any plan to break up clip.cpp so that this kind of model-specific code can live in a <model-name>.cpp file? Right now, it looks like the arch-specific files are only for graph building, but it seems like it could go a lot further to encode this sort of logic as well (this is probably a much bigger conversation that bleeds into the model-modularity conversation in the core as well).
| mtmd_audio_cache cache; | ||
| }; | ||
|
|
||
| struct mtmd_audio_preprocessor_granite_speech : mtmd_audio_preprocessor { |
There was a problem hiding this comment.
Similar question to @ngxson about the modularity plans. This also seems ripe for isolation.
Right, that's the goal of these modular models. I was clearly a bit confused thinking that 4.0 speech had kept the adapter separate like 3.3 did. I know that 4.0 vision did keep them separate. The ultimate goal is a single running model with modality-specific adapters that toggle on/off automatically allowing a single model to server all modalities without sacrificing the text quality for text-only. Now that we've got this working for the 3.3 model, I'll use that as a testbed for my modality-conditional-adapter branch. |
That would be #13693 |
Right! Thanks for the reminder. I'll look back over that and make sure I haven't duplicated anything |
gabe-l-hart
left a comment
There was a problem hiding this comment.
The code looks clean now! I like the removal of magic numbers. I did find that we need to double-register the model class in convert_hf_to_gguf.py to support lora adapter conversion for 3.3.
| // audio | ||
| int32_t n_mel_bins = 0; // whisper preprocessor | ||
| int32_t proj_stack_factor = 0; // ultravox | ||
| int32_t audio_chunk_size = 0; |
There was a problem hiding this comment.
I definitely like having all of these as clip_hparams, but I'm curious to hear from @ngxson on whether it's better to go this route vs hard-coding until we nave n >= 2 values.
|
@ngxson @CISC Can one of you release CI on this? I think all of the PR comments have been addressed once the fix to @ReinforcedKnowledge the merge conflicts should be pretty easy to work through since they're just enum/name list conflicts. |
c09a6d9 to
b4cc586
Compare
|
@gabe-l-hart yes totally they were easy to go through, I also fixed the |
gabe-l-hart
left a comment
There was a problem hiding this comment.
One small vertical alignment NIT, but otherwise I think this looks great. I think the failing CI is not related to these changes.
I've re-verified that the 3.3 version with the LoRA converts and runs as expected.
Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding.
f046158 to
3051baa
Compare
Overview
Adds support for ibm-granite/granite-4.0-1b-speech.
granite-speech.cpp)Tested with greedy decoding on 30s/60s/120s/180s/360s clips, token-for-token match against HF transformers (following their script on the model card) for 30s and 60s. Too heavy for me to run for longer on HF but at 120s/180s there is noticeable degradation and at 360s it completely loops.
Test command:
ffmpeg -i input.wav -t 30 -ar 16000 -ac 1 test.wav python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 --mmproj ./build/bin/llama-mtmd-cli -m models/granite-4.0-1b-speech/granite-4.0-1B-speech-F16.gguf --mmproj models/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-F16.gguf --audio test.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0 -c 4096Also test the UI:
Uploading an audio file and using the prompt above produces the same transcription as the CLI.
Notes:
--jinjais required and the prompt "can you transcribe the speech into a written format?" is taken from the model card.Requirements
EDIT: Added the comment on testing the chat UI.