Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b by DannyYuyang-quic · Pull Request #18740 · pytorch/executorch

DannyYuyang-quic · 2026-04-07T16:28:43Z

Summary

Support granite-speech-3.3-2b
Extend Audio modality in QNNMultimodal AOT flow
Extend Audio modality in QNNMultimodal runner
Support encoder model sharding

Test plan

CI

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true"

Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true
Prompt: "can you transcribe the speech into a written format?"
Result

I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported)
I 00:00:16.334231 executorch:stats.h:161] 	Prompt Tokens: 212    Generated Tokens: 201
I 00:00:16.334356 executorch:stats.h:167] 	Model Load Time:		1.460000 (seconds)
I 00:00:16.334419 executorch:stats.h:177] 	Total inference time:		14.871000 (seconds)		 Rate: 	13.516240 (tokens/second)
I 00:00:16.334480 executorch:stats.h:185] 		Prompt evaluation:	0.798000 (seconds)		 Rate: 	265.664160 (tokens/second)
I 00:00:16.334541 executorch:stats.h:196] 		Generated 201 tokens:	14.073000 (seconds)		 Rate: 	14.282669 (tokens/second)
I 00:00:16.334629 executorch:stats.h:204] 	Time to first generated token:	0.798000 (seconds)
I 00:00:16.334688 executorch:stats.h:211] 	Sampling time over 413 tokens:	0.479000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device

PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s)
/data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s)
[INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]:
<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it.

If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format:

"After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal."

This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|>

cc: @abhinaykukkadapu, @cccclai, @haowhsu-quic

pytorch-bot · 2026-04-07T16:28:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 2 New Failures, 4 Unrelated Failures

As of commit 9df0f46 with merge base e281726 ():

NEW FAILURES - The following jobs have failed:

Build Presets / linux (pybind, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
docker: Error response from daemon: Head "https://registry-1.docker.io/v2/arm64v8/alpine/manifests/latest": Get "https://auth.docker.io/token?account=pytorchbot&scope=repository%3Aarm64v8%2Falpine%3Apull&service=registry.docker.io": context deadline exceeded (Client.Timeout exceeded while awaiting headers).
Build Presets / windows (windows) / build (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model
pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-04-07T16:29:09Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2026-04-13T05:31:11Z

Hi @abhinaykukkadapu,

This PR extend Audio modality in QNNMultimodal

Could you please help to take a look?
Thanks!

abhinaykukkadapu · 2026-04-20T03:19:44Z

@claude can you review this PR?

claude · 2026-04-20T03:20:26Z

Claude finished @abhinaykukkadapu's task in 3m 30s —— View job

Review: PR #18740 — Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b

Overall Assessment: This is a well-structured PR that extends the QNN multimodal pipeline from vision-only to also support audio modality. The architecture cleanly mirrors the existing VLM support patterns. There are a few issues worth addressing before merging.

Architecture and Design

The PR follows the established multimodal architecture well:

AudioModalityConfig / AudioLanguageModel parallel the existing VisionModalityConfig / VisionLanguageModel
audio_chat_template.h mirrors vision_chat_template.h
The MultimodalEmbeddingMerger refactoring (from vision-specific to generic) is a good simplification

The ModelVersion variant (std::variant<VisionLanguageModel, AudioLanguageModel>) with modality_of() helper functions is a clean approach for dispatching based on model type.

Issues

1. Bare `except` in weight conversion

examples/models/granite_speech/convert_weights.py:40-42:

try:
    new_key = get_mapped_key(key, _GRANITE_TO_EXECUTORCH)
    converted_state_dict[new_key] = value
except:
    # only preserve parameters of text decoder
    pass

A bare except will silently swallow all exceptions including KeyboardInterrupt, SystemExit, and actual bugs. This should be except KeyError: or except Exception: at minimum. Fix this →

2. `expand_audio_tokens` is a no-op

runner/multimodal_runner/audio_chat_template.h:45-62 — expand_audio_tokens() replaces specials.audio_token with audio_prompt, but audio_prompt is set to specials.audio_token (same string). The function replaces each occurrence of <audio> with <audio>, which is effectively a no-op. If this is intentional scaffolding for future models that need wrapper tokens (like vision does with fake_token_around_image), it might be worth adding a brief comment. Otherwise, this dead code should be removed to avoid confusion.

3. Typo in `multimodal_embedding_merger.h`

runner/multimodal_runner/multimodal_embedding_merger.h:49:

// merged embeddings are holded in this vector.

"holded" → "held"

4. `load_audio` and `load_image` defined in header with no `inline`

runner/multimodal_runner/utils.h:73 and utils.h:118 — load_audio() and load_image() are non-inline non-template function definitions in a header file. If this header is included in more than one translation unit, this will cause ODR violations / linker errors. These should be marked inline (like the other functions in the file) or moved to a .cpp file. Fix this →

5. `dispatch_inputs` in `chat_template.h` not marked `inline`

runner/multimodal_runner/chat_template.h:177 — Same issue as above. dispatch_inputs() is a non-inline function defined in a header. Should be inline. Fix this →

6. `encoder_pte_size` key rename may break existing VLM CI

In test_qnn_delegate.py:7086, the key changes from encoder_pte_size to vision_encoder_pte_size. This is consistent with the new f"{modality}_pte_size" pattern in llama.py:221, but make sure all existing VLM CI tests are updated to use vision_encoder_pte_size instead of encoder_pte_size. The test code here is updated, so this should be fine as long as no other consumers exist.

Minor Observations

Hardcoded audio feature dimensions: audio_encoder.py:150 has get_example_inputs returning a hardcoded shape (1, 844, 160). This is reasonable for a single model but could become fragile if multiple Granite Speech variants are added. Consider deriving these from the config.
GraniteSpeechCTCEncoderWrapper.forward vs get_example_inputs: The example input shape (1, 844, 160) should match what the encoder actually expects. Worth verifying this matches the input_features shape from AutoProcessor in dataset.py:94.
CustomBlip2QFormerSelfOutput identity workaround: The identity linear layer workaround (audio_encoder.py:108-110) is clearly documented as an HTP preparation workaround. This is reasonable but should be revisited when newer QNN SDK versions are available.
Comment in tokenizer.py:187: Says "Load image from user-specified path" but should say "Load audio from user-specified path" since this is the audio encoder branch.

Positive Aspects

The MultimodalEmbeddingMerger refactoring significantly reduces code duplication by replacing the previous vision-specific ImageTokEmbeddingMerger with a generic approach
Clean separation of audio and vision chat templates
The Modality enum + modality_of() dispatch pattern is well-designed for extensibility
Encoder model sharding support (encoder_config.py:97 — num_sharding = 8) with proper pattern matching for layer-based splitting
Good test coverage with test_static_asr and test_qnn_conv1d_batch_norm
The quantization recipe for GraniteSpeech encoder uses HistogramObserver for activations and includes per-block quantization for conv layers, which is appropriate for the conformer architecture

Summary

The PR is in good shape overall. The main items to address are:

ODR violation risk — mark load_audio, load_image, and dispatch_inputs as inline
Bare except in convert_weights.py
Minor: no-op expand_audio_tokens, typo "holded", and wrong "image" comment in the audio branch of tokenizer.py

abhinaykukkadapu · 2026-04-20T03:29:40Z

+        self.config = config
+
+    def get_example_inputs(self):
+        return (torch.randn((1, 844, 160), dtype=torch.float32),)


@DannyYuyang-quic do we want to hardcode this for all audio encoder models?

Thanks for pointing this out. In the latest commit, I parameterized the audio encoder input shape instead of hardcoding it.
The first dimension is the batch size, which we’re keeping at 1 for now. The second dimension comes from the wav preprocessing pipeline, and the last dimension is model‑specific.

abhinaykukkadapu · 2026-04-20T03:36:18Z

-    ) or args.use_attention_sink is None, (
-        "Multimodal models currently do not support attention sink feature."
-    )
+        is_multimodal or args.use_attention_sink is None


This is from claude: The PR inverted the assertion. Text-only models (which historically supported attention-sink) now crash at export if you pass
--use_attention_sink. Multimodal models (which the TODO right above says are not implemented) now silently export with whatever half-written
code path exists.

Thanks for catching this. The assertion logic was inverted. I’ve reverted the condition so text‑only models behave as before.

abhinaykukkadapu · 2026-04-20T05:24:43Z

-                    qdq_intermediate_outputs = request.method_data[
-                        VISION_ENCODER
-                    ].calibration_data.qdq_intermediate_outputs
+                audio_turns = request.method_data[


Should we need the self.apply_embedding check?

No, we don’t need the self.apply_embedding check anymore. It’s redundant, because if tok_embedding isn’t provided in the _calibration function, it will go into the Multimodal model inference flow.

abhinaykukkadapu · 2026-04-20T05:42:44Z

@DannyYuyang-quic thanks for adding support for the speech model, sorry it took some time to get to this. One quick request (you can follow up later) can we add any audio multimodal tests in CI, if it takes too much time we can add it to trunk.yaml. Thanks!

I also imported it internally to see if ci is green. Will monitor and merge as soon as possible, can you please rebase.

DannyYuyang-quic · 2026-04-21T09:06:31Z

@DannyYuyang-quic thanks for adding support for the speech model, sorry it took some time to get to this. One quick request (you can follow up later) can we add any audio multimodal tests in CI, if it takes too much time we can add it to trunk.yaml. Thanks!

I also imported it internally to see if ci is green. Will monitor and merge as soon as possible, can you please rebase.

Hi @abhinaykukkadapu, Thanks for the review! I really appreciate you taking the time, especially since this PR is on the larger side.

I think adding audio multimodal coverage to test CI is a great suggestion.

Current speech model(2B) in this PR is quite heavy, AOT takes ~2 hours and runtime(x86 emulator,) takes ~2.5 hours which may be too slow.
I can try to find a smaller model that validates end‑to‑end in under an hour, or we could add this to trunk.yaml instead.

Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding

meta-codesync · 2026-04-21T22:58:29Z

@abhinaykukkadapu has imported this pull request. If you are a Meta employee, you can view this in D101574849.

abhinaykukkadapu · 2026-04-22T17:07:14Z

+            bias=bias,
+        )
+        self.batch_norm = torch.nn.BatchNorm1d(2048)
+


@DannyYuyang-quic thanks for making the changes, one more issue, this test need self.eval()

Thanks for pointing this out! I just added it at L538

…ch#18740) Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding Pull Request resolved: pytorch#18740 Test Plan: #### CI ``` bash python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_asr --model_name granite_speech_3_3-2b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM} ``` #### Script ```bash python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m SM8750 --decoder_model granite_speech_3_3-2b --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "can you transcribe the speech into a written format?" --audio_path "https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true" ``` Audio file: https://huggingface.co/ibm-granite/granite-speech-3.3-2b/resolve/main/10226_10111_000000.wav?download=true Prompt: "can you transcribe the speech into a written format?" Result ``` bash I 00:00:16.333997 executorch:multimodal_runner.cpp:542] RSS after finishing text generation: 614.941406 MiB (0 if unsupported) I 00:00:16.334231 executorch:stats.h:161] Prompt Tokens: 212 Generated Tokens: 201 I 00:00:16.334356 executorch:stats.h:167] Model Load Time: 1.460000 (seconds) I 00:00:16.334419 executorch:stats.h:177] Total inference time: 14.871000 (seconds) Rate: 13.516240 (tokens/second) I 00:00:16.334480 executorch:stats.h:185] Prompt evaluation: 0.798000 (seconds) Rate: 265.664160 (tokens/second) I 00:00:16.334541 executorch:stats.h:196] Generated 201 tokens: 14.073000 (seconds) Rate: 14.282669 (tokens/second) I 00:00:16.334629 executorch:stats.h:204] Time to first generated token: 0.798000 (seconds) I 00:00:16.334688 executorch:stats.h:211] Sampling time over 413 tokens: 0.479000 (seconds) [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device PyTorchObserver {"prefill_token_per_sec":265.664,"decode_token_per_sec":14.2827,"prompt_tokens":212,"generated_tokens":201,"model_load_start_ms":1744743525724,"model_load_end_ms":1744743527184,"inference_start_ms":1744743527186,"inference_end_ms":1744743542057,"prompt_eval_end_ms":1744743527984,"first_token_ms":1744743527984,"aggregate_sampling_time_ms":479,"SCALING_FACTOR_UNITS_PER_SECOND":1000} [INFO] [Qnn ExecuTorch]: Destroy Qnn backend /data/local/tmp/yuyazhua/executorch/static_llm/outputs/outputs.txt: 1 file pulled. 0.9 MB/s (1170 bytes in 0.001s) /data/local/tmp/yuyazhua/executorch/static_llm/outputs/inference_speed.txt: 1 file pulled. 0.0 MB/s (7 bytes in 0.002s) [INFO 2026-04-08 00:22:11,849 llama.py:243] Device Inference Results[0]: <|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|> <|start_of_role|>user<|end_of_role|>can you transcribe the speech into a written format?<|end_of_text|> <|start_of_role|>assistant<|end_of_role|>It appears you've provided a fragment of a sentence, possibly from a poem or text, and you're asking for a transcription or translation into written format. However, without the complete context or original text, it's challenging to accurately transcribe or translate it. If we were to proceed with a hypothetical example, here's a possible continuation of the sentence in a written format: "After his nap, Timothy leisurely stretched his foot, first one then the other, carefully selecting the choicest bits. Turning over the food, he methodically picked out the desired portions, meticulously choosing what was to be included in his meal." This continuation assumes a narrative style, where Timothy is taking care of food preparation. The original sentence seems to be a playful or poetic exploration of a character's actions, possibly related to food preparation or a cooking process.<|end_of_text|> ``` cc: abhinaykukkadapu, cccclai, haowhsu-quic Differential Revision: D101574849 Pulled By: abhinaykukkadapu

- Revert the attention_sink assertion for multimodal models - Parameterize the audio encoder input shape - Fix comment typos - Add self.eval() in Conv1dBn test

DannyYuyang-quic requested review from abhinaykukkadapu, cccclai, larryliu0820, lucylq and mergennachin as code owners April 7, 2026 16:28

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026

pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 7, 2026

DannyYuyang-quic changed the title ~~Qualcomm AI Engine Direct - [Multimodal] granite-3.3-2b-instruct~~ Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b Apr 9, 2026

abhinaykukkadapu reviewed Apr 20, 2026

View reviewed changes

DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from b01dff2 to eecdc18 Compare April 21, 2026 08:28

Qualcomm AI Engine Direct - [Multimodal] granite-3.3-2b-instruct

2e4c9bc

Summary: - Support granite-speech-3.3-2b - Extend Audio modality in QNNMultimodal AOT flow - Extend Audio modality in QNNMultimodal runner - Support encoder model sharding

DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from eecdc18 to ac51a36 Compare April 21, 2026 14:43

abhinaykukkadapu approved these changes Apr 21, 2026

View reviewed changes

abhinaykukkadapu reviewed Apr 22, 2026

View reviewed changes

reslove code review items:

9df0f46

- Revert the attention_sink assertion for multimodal models - Parameterize the audio encoder input shape - Fix comment typos - Add self.eval() in Conv1dBn test

DannyYuyang-quic force-pushed the dev1/danny/support_audio-language_models branch from ac51a36 to 9df0f46 Compare April 23, 2026 03:11

Conversation

DannyYuyang-quic commented Apr 7, 2026

Summary

Test plan

CI

Script

Uh oh!

pytorch-bot Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18740

❗ 1 Active SEVs

❌ 2 New Failures, 4 Unrelated Failures

Uh oh!

DannyYuyang-quic commented Apr 7, 2026

Uh oh!

DannyYuyang-quic commented Apr 13, 2026

Uh oh!

abhinaykukkadapu commented Apr 20, 2026

Uh oh!

claude Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #18740 — Qualcomm AI Engine Direct - [Multimodal] granite-speech-3.3-2b

Architecture and Design

Issues

1. Bare except in weight conversion

2. expand_audio_tokens is a no-op

3. Typo in multimodal_embedding_merger.h

4. load_audio and load_image defined in header with no inline

5. dispatch_inputs in chat_template.h not marked inline

6. encoder_pte_size key rename may break existing VLM CI

Minor Observations

Positive Aspects

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DannyYuyang-quic commented Apr 21, 2026

Uh oh!

meta-codesync Bot commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Apr 7, 2026 •

edited

Loading

claude Bot commented Apr 20, 2026 •

edited

Loading

1. Bare `except` in weight conversion

2. `expand_audio_tokens` is a no-op

3. Typo in `multimodal_embedding_merger.h`

4. `load_audio` and `load_image` defined in header with no `inline`

5. `dispatch_inputs` in `chat_template.h` not marked `inline`

6. `encoder_pte_size` key rename may break existing VLM CI

abhinaykukkadapu commented Apr 20, 2026 •

edited

Loading