Skip to content

Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740

Open
DannyYuyang-quic wants to merge 2 commits intopytorch:mainfrom
CodeLinaro:dev1/danny/support_audio-language_models
Open

Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740
DannyYuyang-quic wants to merge 2 commits intopytorch:mainfrom
CodeLinaro:dev1/danny/support_audio-language_models

Conversation

@DannyYuyang-quic
Copy link
Copy Markdown
Contributor

Summary

  • Support granite-speech-3.3-2b
  • Extend Audio modality in QNNMultimodal AOT flow
  • Extend Audio modality in QNNMultimodal runner
  • Support encoder model sharding

Test plan

CI

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"

Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true
Prompt: "can you transcribe the speech into a written format?"
Result

I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported)
I 00:00:16.334231 executorch:stats.h:161] 	Prompt Tokens: 212    Generated Tokens: 201
I 00:00:16.334356 executorch:stats.h:167] 	Model Load Time:		1.460000 (seconds)
I 00:00:16.334419 executorch:stats.h:177] 	Total inference time:		14.871000 (seconds)		 Rate: 	13.516240 (tokens/second)
I 00:00:16.334480 executorch:stats.h:185] 		Prompt evaluation:	0.798000 (seconds)		 Rate: 	265.664160 (tokens/second)
I 00:00:16.334541 executorch:stats.h:196] 		Generated 201 tokens:	14.073000 (seconds)		 Rate: 	14.282669 (tokens/second)
I 00:00:16.334629 executorch:stats.h:204] 	Time to first generated token:	0.798000 (seconds)
I 00:00:16.334688 executorch:stats.h:211] 	Sampling time over 413 tokens:	0.479000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device

PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s)
[INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]:
<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it.

If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format:

"After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal."

This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|>

cc: @abhinaykukkadapu, @cccclai, @haowhsu-quic

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 4 Unrelated Failures

As of commit 9df0f46 with merge base e281726 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 7, 2026
@DannyYuyang-quic DannyYuyang-quic changed the title Qualcomm AI Engine Direct - [Multimodal] granite-3.3-2b-instruct Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b Apr 9, 2026
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

Hi @abhinaykukkadapu,

This PR extend Audio modality in QNNMultimodal

Could you please help to take a look?
Thanks!

@abhinaykukkadapu
Copy link
Copy Markdown
Contributor

@claude can you review this PR?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @abhinaykukkadapu's task in 3m 30s —— View job


Review: PR #18740 — Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b

Overall Assessment: This is a well-structured PR that extends the QNN multimodal pipeline from vision-only to also support audio modality. The architecture cleanly mirrors the existing VLM support patterns. There are a few issues worth addressing before merging.


Architecture and Design

The PR follows the established multimodal architecture well:

  • AudioModalityConfig / AudioLanguageModel parallel the existing VisionModalityConfig / VisionLanguageModel
  • audio_chat_template.h mirrors vision_chat_template.h
  • The MultimodalEmbeddingMerger refactoring (from vision-specific to generic) is a good simplification

The ModelVersion variant (std::variant<VisionLanguageModel, AudioLanguageModel>) with modality_of() helper functions is a clean approach for dispatching based on model type.


Issues

1. Bare except in weight conversion

examples/models/granite_speech/convert_weights.py:40-42:

try:
    new_key = get_mapped_key(key, _GRANITE_TO_EXECUTORCH)
    converted_state_dict[new_key] = value
except:
    # only preserve parameters of text decoder
    pass

A bare except will silently swallow all exceptions including KeyboardInterrupt, SystemExit, and actual bugs. This should be except KeyError: or except Exception: at minimum. Fix this →

2. expand_audio_tokens is a no-op

runner/multimodal_runner/audio_chat_template.h:45-62expand_audio_tokens() replaces specials.audio_token with audio_prompt, but audio_prompt is set to specials.audio_token (same string). The function replaces each occurrence of <audio> with <audio>, which is effectively a no-op. If this is intentional scaffolding for future models that need wrapper tokens (like vision does with fake_token_around_image), it might be worth adding a brief comment. Otherwise, this dead code should be removed to avoid confusion.

3. Typo in multimodal_embedding_merger.h

runner/multimodal_runner/multimodal_embedding_merger.h:49:

// merged embeddings are holded in this vector.

"holded" → "held"

4. load_audio and load_image defined in header with no inline

runner/multimodal_runner/utils.h:73 and utils.h:118load_audio() and load_image() are non-inline non-template function definitions in a header file. If this header is included in more than one translation unit, this will cause ODR violations / linker errors. These should be marked inline (like the other functions in the file) or moved to a .cpp file. Fix this →

5. dispatch_inputs in chat_template.h not marked inline

runner/multimodal_runner/chat_template.h:177 — Same issue as above. dispatch_inputs() is a non-inline function defined in a header. Should be inline. Fix this →

6. encoder_pte_size key rename may break existing VLM CI

In test_qnn_delegate.py:7086, the key changes from encoder_pte_size to vision_encoder_pte_size. This is consistent with the new f"{modality}_pte_size" pattern in llama.py:221, but make sure all existing VLM CI tests are updated to use vision_encoder_pte_size instead of encoder_pte_size. The test code here is updated, so this should be fine as long as no other consumers exist.


Minor Observations

  • Hardcoded audio feature dimensions: audio_encoder.py:150 has get_example_inputs returning a hardcoded shape (1, 844, 160). This is reasonable for a single model but could become fragile if multiple Granite Speech variants are added. Consider deriving these from the config.

  • GraniteSpeechCTCEncoderWrapper.forward vs get_example_inputs: The example input shape (1, 844, 160) should match what the encoder actually expects. Worth verifying this matches the input_features shape from AutoProcessor in dataset.py:94.

  • CustomBlip2QFormerSelfOutput identity workaround: The identity linear layer workaround (audio_encoder.py:108-110) is clearly documented as an HTP preparation workaround. This is reasonable but should be revisited when newer QNN SDK versions are available.

  • Comment in tokenizer.py:187: Says "Load image from user-specified path" but should say "Load audio from user-specified path" since this is the audio encoder branch.


Positive Aspects

  • The MultimodalEmbeddingMerger refactoring significantly reduces code duplication by replacing the previous vision-specific ImageTokEmbeddingMerger with a generic approach
  • Clean separation of audio and vision chat templates
  • The Modality enum + modality_of() dispatch pattern is well-designed for extensibility
  • Encoder model sharding support (encoder_config.py:97num_sharding = 8) with proper pattern matching for layer-based splitting
  • Good test coverage with test_static_asr and test_qnn_conv1d_batch_norm
  • The quantization recipe for GraniteSpeech encoder uses HistogramObserver for activations and includes per-block quantization for conv layers, which is appropriate for the conformer architecture

Summary

The PR is in good shape overall. The main items to address are:

  1. ODR violation risk — mark load_audio, load_image, and dispatch_inputs as inline
  2. Bare except in convert_weights.py
  3. Minor: no-op expand_audio_tokens, typo "holded", and wrong "image" comment in the audio branch of tokenizer.py

self.config = config

def get_example_inputs(self):
return (torch.randn((1, 844, 160), dtype=torch.float32),)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DannyYuyang-quic do we want to hardcode this for all audio encoder models?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. In the latest commit, I parameterized the audio encoder input shape instead of hardcoding it.
The first dimension is the batch size, which we’re keeping at 1 for now. The second dimension comes from the wav preprocessing pipeline, and the last dimension is model‑specific.

) or args.use_attention_sink is None, (
"Multimodal models currently do not support attention sink feature."
)
is_multimodal or args.use_attention_sink is None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from claude: The PR inverted the assertion. Text-only models (which historically supported attention-sink) now crash at export if you pass
--use_attention_sink. Multimodal models (which the TODO right above says are not implemented) now silently export with whatever half-written
code path exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. The assertion logic was inverted. I’ve reverted the condition so text‑only models behave as before.

qdq_intermediate_outputs = request.method_data[
VISION_ENCODER
].calibration_data.qdq_intermediate_outputs
audio_turns = request.method_data[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we need the self.apply_embedding check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don’t need the self.apply_embedding check anymore. It’s redundant, because if tok_embedding isn’t provided in the _calibration function, it will go into the Multimodal model inference flow.

@abhinaykukkadapu
Copy link
Copy Markdown
Contributor

abhinaykukkadapu commented Apr 20, 2026

@DannyYuyang-quic thanks for adding support for the speech model, sorry it took some time to get to this. One quick request (you can follow up later) can we add any audio multimodal tests in CI, if it takes too much time we can add it to trunk.yaml. Thanks!

I also imported it internally to see if ci is green. Will monitor and merge as soon as possible, can you please rebase.

@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from b01dff2 to eecdc18 Compare April 21, 2026 08:28
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

@DannyYuyang-quic thanks for adding support for the speech model, sorry it took some time to get to this. One quick request (you can follow up later) can we add any audio multimodal tests in CI, if it takes too much time we can add it to trunk.yaml. Thanks!

I also imported it internally to see if ci is green. Will monitor and merge as soon as possible, can you please rebase.

Hi @abhinaykukkadapu, Thanks for the review! I really appreciate you taking the time, especially since this PR is on the larger side.

I think adding audio multimodal coverage to test CI is a great suggestion.

Current speech model(2B) in this PR is quite heavy, AOT takes ~2 hours and runtime(x86 emulator,) takes ~2.5 hours which may be too slow.
I can try to find a smaller model that validates end‑to‑end in under an hour, or we could add this to trunk.yaml instead.

Summary:
 - Support granite-speech-3.3-2b
 - Extend Audio modality in QNNMultimodal AOT flow
 - Extend Audio modality in QNNMultimodal runner
 - Support encoder model sharding
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from eecdc18 to ac51a36 Compare April 21, 2026 14:43
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 21, 2026

@abhinaykukkadapu has imported this pull request. If you are a Meta employee, you can view this in D101574849.

bias=bias,
)
self.batch_norm = torch.nn.BatchNorm1d(2048)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DannyYuyang-quic thanks for making the changes, one more issue, this test need self.eval()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I just added it at L538

abhinaykukkadapu pushed a commit to abhinaykukkadapu/executorch that referenced this pull request Apr 22, 2026
…ch#18740)

Summary:
- Support granite-speech-3.3-2b
 - Extend Audio modality in QNNMultimodal AOT flow
 - Extend Audio modality in QNNMultimodal runner
 - Support encoder model sharding

Pull Request resolved: pytorch#18740

Test Plan:
#### CI
``` bash
python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}
```

#### Script

```bash
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"
```
Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true
Prompt: "can you transcribe the speech into a written format?"
Result
``` bash
I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported)
I 00:00:16.334231 executorch:stats.h:161] 	Prompt Tokens: 212    Generated Tokens: 201
I 00:00:16.334356 executorch:stats.h:167] 	Model Load Time:		1.460000 (seconds)
I 00:00:16.334419 executorch:stats.h:177] 	Total inference time:		14.871000 (seconds)		 Rate: 	13.516240 (tokens/second)
I 00:00:16.334480 executorch:stats.h:185] 		Prompt evaluation:	0.798000 (seconds)		 Rate: 	265.664160 (tokens/second)
I 00:00:16.334541 executorch:stats.h:196] 		Generated 201 tokens:	14.073000 (seconds)		 Rate: 	14.282669 (tokens/second)
I 00:00:16.334629 executorch:stats.h:204] 	Time to first generated token:	0.798000 (seconds)
I 00:00:16.334688 executorch:stats.h:211] 	Sampling time over 413 tokens:	0.479000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device

PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s)
[INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]:
<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it.

If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format:

"After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal."

This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|>
```

cc: abhinaykukkadapu, cccclai, haowhsu-quic

Differential Revision: D101574849

Pulled By: abhinaykukkadapu
 - Revert the attention_sink assertion for multimodal models
 - Parameterize the audio encoder input shape
 - Fix comment typos
 - Add self.eval() in Conv1dBn test
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from ac51a36 to 9df0f46 Compare April 23, 2026 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants