Support Audio Flamingo Next checkpoints by lashahub · Pull Request #39011 · vllm-project/vllm

lashahub · 2026-04-05T03:40:31Z

Summary

This PR updates MusicFlamingo (MF) and adds AudioFlamingoNext (AF-Next) support to vLLM.

AF-Next is implemented on top of the corrected MF implementation, since the two models are architecturally very close in the current Hugging Face reference.

Why this PR is needed

AF-Next is being added to Hugging Face Transformers here:

huggingface/transformers#44830

The Hugging Face MusicFlamingo integration is here:

huggingface/transformers#43538

MF was already merged into vLLM earlier, but the HF implementation changed after that PR. As a result, some follow-up MF updates were necessary here so the vLLM path stays closely aligned with the current Transformers behavior.

Since AF-Next is close to MF, the right integration in vLLM is to reuse the updated MF path instead of introducing a separate divergent implementation.

What this PR changes

updates MF so it stays aligned with the current HF implementation
adds AudioFlamingoNextForConditionalGeneration support to the vLLM registry
adds an AF-Next vLLM model wrapper
reuses the MF-aligned audio/text processing path as much as possible
adds AF-Next processing tests
adds AF-Next generation tests
adds single and batched fixtures for nvidia/audio-flamingo-next-hf
adds AF-Next to docs/models/supported_models.md
adds an AF-Next example to examples/offline_inference/audio_language.py

Test plan

python -m pytest tests/models/multimodal/processing/test_audioflamingonext.py -q
python -m pytest tests/models/multimodal/generation/test_audioflamingonext.py -q -rs

Test results

tests/models/multimodal/processing/test_audioflamingonext.py: passed
tests/models/multimodal/generation/test_audioflamingonext.py: passed
single and batched fixture checks: passed

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

mergify · 2026-04-05T03:41:28Z

Documentation preview: https://vllm--39011.org.readthedocs.build/en/39011/

gemini-code-assist

Code Review

This pull request introduces support for the AudioFlamingoNext model and refactors the MusicFlamingo implementation to dynamically calculate audio timestamps rather than requiring them as input. The changes include new model registration, example scripts, and comprehensive tests. The review feedback identifies critical issues in the refactored logic: hardcoded stride factors (using 4 instead of 8) in the window duration and frame step calculations will result in incorrect temporal embeddings, and the default partial rotary factor must be set to 0.5 to prevent a runtime crash caused by the concatenated rotation dimensions exceeding the hidden state size.

gemini-code-assist · 2026-04-05T03:45:42Z

+        window_starts = timestamps[:, 0].to(
+            device=self.inv_freq.device, dtype=self.inv_freq.dtype
+        )
+        window_duration = self.config.audio_frame_step * 4 * seq_len


The hardcoded factor 4 in the window_duration calculation appears to mismatch the actual stride of the audio encoder. MusicFlamingoEncoder inherits from AudioFlamingo3Encoder, which includes an avg_pooler with stride 2 in addition to two convolutional layers with stride 2, resulting in a total stride of 8. If the encoder stride is 8, this factor should be updated to 8 to correctly calculate the duration of a chunk (e.g., 0.01 * 8 * 375 = 30s). Using 4 with a stride-8 encoder will result in an incorrect window_duration of 15s, leading to wrong RoPE timestamps for subsequent chunks.

Suggested change

window_duration = self.config.audio_frame_step * 4 * seq_len

window_duration = self.config.audio_frame_step * 8 * seq_len

gemini-code-assist · 2026-04-05T03:45:42Z

-            mm_data=mm_data,
-            mm_kwargs=mm_kwargs,
-            tok_kwargs=tok_kwargs,
+    audio_embed_frame_step = audio_frame_step * 4


Similar to the window_duration calculation, the audio_embed_frame_step factor should match the total stride of the audio encoder. If the encoder uses the avg_pooler from AudioFlamingo3Encoder, the total stride is 8. Using a factor of 4 here will cause a mismatch in the absolute timestamps generated for audio chunks, which will negatively impact the RoPE embeddings and model performance.

Suggested change

audio_embed_frame_step = audio_frame_step * 4

audio_embed_frame_step = audio_frame_step * 8

gemini-code-assist · 2026-04-05T03:45:42Z

+        partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
+        head_dim = getattr(config, "head_dim", None) or (
            config.hidden_size // config.num_attention_heads
        )
+        dim = int(head_dim * partial_rotary_factor)


The partial_rotary_factor currently defaults to 1.0 if not specified in the config. However, the RoPE implementation in MusicFlamingoRotaryEmbedding uses torch.cat((window_freqs, time_freqs), dim=-1), which results in a total rotation dimension of 2 * dim. If partial_rotary_factor is 1.0, then dim = head_dim, and the concatenated frequency tensor will have size 2 * head_dim. This exceeds the feature dimension of the hidden states (which is head_dim) and will cause a runtime crash in apply_rotary_time_emb during multiplication. For this model architecture, the factor should default to 0.5 to ensure the total rotation dimension matches head_dim.

Suggested change

partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)

head_dim = getattr(config, "head_dim", None) or (

config.hidden_size // config.num_attention_heads

)

dim = int(head_dim * partial_rotary_factor)

partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 0.5)

head_dim = getattr(config, "head_dim", None) or (

config.hidden_size // config.num_attention_heads

)

dim = int(head_dim * partial_rotary_factor)

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

lashahub · 2026-04-13T08:40:20Z

Update: changed code to match the recent HF changes for Music Flamingo and Audio Flamingo Next.

kept the shared rotary-time embedding buffers in FP32 in vllm/model_executor/models/musicflamingo.py
refreshed the AF-Next expected outputs
added checks in the processing tests so those buffers stay FP32 after dtype conversion

Reason: vLLM was casting these buffers to BF16, which caused output differences from HF.

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

mergify · 2026-05-22T19:52:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lashahub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

lashahub added 2 commits April 4, 2026 19:12

Update MF

798f5ca

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

Add AudioFlamingoNext

5fca070

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

lashahub requested review from DarkLight1337 and ywang96 as code owners April 5, 2026 03:40

mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models labels Apr 5, 2026

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

lashahub marked this pull request as draft April 6, 2026 02:07

lashahub mentioned this pull request Apr 11, 2026

Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling #37643

Merged

5 tasks

lashahub added 2 commits April 13, 2026 04:13

Merge branch 'vllm-project:main' into af-next

5603c7b

Fix MusicFlamingo and AF-Next RoTE precision

31de738

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

lashahub marked this pull request as ready for review April 13, 2026 08:36

Merge branch 'main' into af-next

3359998

DarkLight1337 reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/models/multimodal/generation/test_audioflamingonext.py Outdated

DarkLight1337 reviewed Apr 13, 2026

View reviewed changes

Comment thread vllm/model_executor/models/audioflamingonext.py Outdated

Clean up AF-Next wrappers and generation tests

058a148

Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>

eustlb mentioned this pull request May 20, 2026

[ROCm][CI] Gate incompatible HF references on Transformers v5 #41532

Open

lashahub changed the title ~~Update MusicFlamingo and add AudioFlamingoNext~~ Support Audio Flamingo Next checkpoints May 22, 2026

mergify Bot added the needs-rebase label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Audio Flamingo Next checkpoints#39011

Support Audio Flamingo Next checkpoints#39011
lashahub wants to merge 6 commits into
vllm-project:mainfrom
lashahub:af-next

lashahub commented Apr 5, 2026

Uh oh!

mergify Bot commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

lashahub commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	window_duration = self.config.audio_frame_step * 4 * seq_len
	window_duration = self.config.audio_frame_step * 8 * seq_len

	audio_embed_frame_step = audio_frame_step * 4
	audio_embed_frame_step = audio_frame_step * 8

Uh oh!

Conversation

lashahub commented Apr 5, 2026

Summary

Why this PR is needed

What this PR changes

Test plan

Test results

Uh oh!

mergify Bot commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

lashahub commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants