Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740
Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b#18740DannyYuyang-quic wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 4 Unrelated FailuresAs of commit 9df0f46 with merge base e281726 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
This PR extend Audio modality in QNNMultimodal Could you please help to take a look? |
|
@claude can you review this PR? |
|
Claude finished @abhinaykukkadapu's task in 3m 30s —— View job Review: PR #18740 — Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2bOverall Assessment: This is a well-structured PR that extends the QNN multimodal pipeline from vision-only to also support audio modality. The architecture cleanly mirrors the existing VLM support patterns. There are a few issues worth addressing before merging. Architecture and DesignThe PR follows the established multimodal architecture well:
The Issues1. Bare
|
| self.config = config | ||
|
|
||
| def get_example_inputs(self): | ||
| return (torch.randn((1, 844, 160), dtype=torch.float32),) |
There was a problem hiding this comment.
@DannyYuyang-quic do we want to hardcode this for all audio encoder models?
There was a problem hiding this comment.
Thanks for pointing this out. In the latest commit, I parameterized the audio encoder input shape instead of hardcoding it.
The first dimension is the batch size, which we’re keeping at 1 for now. The second dimension comes from the wav preprocessing pipeline, and the last dimension is model‑specific.
| ) or args.use_attention_sink is None, ( | ||
| "Multimodal models currently do not support attention sink feature." | ||
| ) | ||
| is_multimodal or args.use_attention_sink is None |
There was a problem hiding this comment.
This is from claude: The PR inverted the assertion. Text-only models (which historically supported attention-sink) now crash at export if you pass
--use_attention_sink. Multimodal models (which the TODO right above says are not implemented) now silently export with whatever half-written
code path exists.
There was a problem hiding this comment.
Thanks for catching this. The assertion logic was inverted. I’ve reverted the condition so text‑only models behave as before.
| qdq_intermediate_outputs = request.method_data[ | ||
| VISION_ENCODER | ||
| ].calibration_data.qdq_intermediate_outputs | ||
| audio_turns = request.method_data[ |
There was a problem hiding this comment.
Should we need the self.apply_embedding check?
There was a problem hiding this comment.
No, we don’t need the self.apply_embedding check anymore. It’s redundant, because if tok_embedding isn’t provided in the _calibration function, it will go into the Multimodal model inference flow.
|
@DannyYuyang-quic thanks for adding support for the speech model, sorry it took some time to get to this. One quick request (you can follow up later) can we add any audio multimodal tests in CI, if it takes too much time we can add it to trunk.yaml. Thanks! I also imported it internally to see if ci is green. Will monitor and merge as soon as possible, can you please rebase. |
b01dff2 to
eecdc18
Compare
Hi @abhinaykukkadapu, Thanks for the review! I really appreciate you taking the time, especially since this PR is on the larger side. I think adding audio multimodal coverage to test CI is a great suggestion. Current speech model(2B) in this PR is quite heavy, AOT takes ~2 hours and runtime(x86 emulator,) takes ~2.5 hours which may be too slow. |
Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding
eecdc18 to
ac51a36
Compare
|
@abhinaykukkadapu has imported this pull request. If you are a Meta employee, you can view this in D101574849. |
| bias=bias, | ||
| ) | ||
| self.batch_norm = torch.nn.BatchNorm1d(2048) | ||
|
|
There was a problem hiding this comment.
@DannyYuyang-quic thanks for making the changes, one more issue, this test need self.eval()
There was a problem hiding this comment.
Thanks for pointing this out! I just added it at L538
…ch#18740) Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding Pull Request resolved: pytorch#18740 Test Plan: #### CI ``` bash python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM} ``` #### Script ```bash python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" ``` Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true Prompt: "can you transcribe the speech into a written format?" Result ``` bash I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported) I 00:00:16.334231 executorch:stats.h:161] Prompt Tokens: 212 Generated Tokens: 201 I 00:00:16.334356 executorch:stats.h:167] Model Load Time: 1.460000 (seconds) I 00:00:16.334419 executorch:stats.h:177] Total inference time: 14.871000 (seconds) Rate: 13.516240 (tokens/second) I 00:00:16.334480 executorch:stats.h:185] Prompt evaluation: 0.798000 (seconds) Rate: 265.664160 (tokens/second) I 00:00:16.334541 executorch:stats.h:196] Generated 201 tokens: 14.073000 (seconds) Rate: 14.282669 (tokens/second) I 00:00:16.334629 executorch:stats.h:204] Time to first generated token: 0.798000 (seconds) I 00:00:16.334688 executorch:stats.h:211] Sampling time over 413 tokens: 0.479000 (seconds) [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000} [INFO] [Qnn ExecuTorch]: Destroy Qnn backend /data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s) /data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s) [INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]: <|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|> <|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|> <|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it. If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format: "After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal." This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|> ``` cc: abhinaykukkadapu, cccclai, haowhsu-quic Differential Revision: D101574849 Pulled By: abhinaykukkadapu
- Revert the attention_sink assertion for multimodal models - Parameterize the audio encoder input shape - Fix comment typos - Add self.eval() in Conv1dBn test
ac51a36 to
9df0f46
Compare
Summary
Test plan
CI
Script
Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true
Prompt: "can you transcribe the speech into a written format?"
Result
cc: @abhinaykukkadapu, @cccclai, @haowhsu-quic