Skip to content

Support Audio Flamingo Next checkpoints#39011

Open
lashahub wants to merge 6 commits into
vllm-project:mainfrom
lashahub:af-next
Open

Support Audio Flamingo Next checkpoints#39011
lashahub wants to merge 6 commits into
vllm-project:mainfrom
lashahub:af-next

Conversation

@lashahub
Copy link
Copy Markdown
Contributor

@lashahub lashahub commented Apr 5, 2026

Summary

This PR updates MusicFlamingo (MF) and adds AudioFlamingoNext (AF-Next) support to vLLM.

AF-Next is implemented on top of the corrected MF implementation, since the two models are architecturally very close in the current Hugging Face reference.

Why this PR is needed

AF-Next is being added to Hugging Face Transformers here:

huggingface/transformers#44830

The Hugging Face MusicFlamingo integration is here:

huggingface/transformers#43538

MF was already merged into vLLM earlier, but the HF implementation changed after that PR. As a result, some follow-up MF updates were necessary here so the vLLM path stays closely aligned with the current Transformers behavior.

Since AF-Next is close to MF, the right integration in vLLM is to reuse the updated MF path instead of introducing a separate divergent implementation.

What this PR changes

  • updates MF so it stays aligned with the current HF implementation
  • adds AudioFlamingoNextForConditionalGeneration support to the vLLM registry
  • adds an AF-Next vLLM model wrapper
  • reuses the MF-aligned audio/text processing path as much as possible
  • adds AF-Next processing tests
  • adds AF-Next generation tests
  • adds single and batched fixtures for nvidia/audio-flamingo-next-hf
  • adds AF-Next to docs/models/supported_models.md
  • adds an AF-Next example to examples/offline_inference/audio_language.py

Test plan

python -m pytest tests/models/multimodal/processing/test_audioflamingonext.py -q
python -m pytest tests/models/multimodal/generation/test_audioflamingonext.py -q -rs

Test results

  • tests/models/multimodal/processing/test_audioflamingonext.py: passed
  • tests/models/multimodal/generation/test_audioflamingonext.py: passed
  • single and batched fixture checks: passed

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

lashahub added 2 commits April 4, 2026 19:12
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Documentation preview: https://vllm--39011.org.readthedocs.build/en/39011/

@mergify mergify Bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models labels Apr 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the AudioFlamingoNext model and refactors the MusicFlamingo implementation to dynamically calculate audio timestamps rather than requiring them as input. The changes include new model registration, example scripts, and comprehensive tests. The review feedback identifies critical issues in the refactored logic: hardcoded stride factors (using 4 instead of 8) in the window duration and frame step calculations will result in incorrect temporal embeddings, and the default partial rotary factor must be set to 0.5 to prevent a runtime crash caused by the concatenated rotation dimensions exceeding the hidden state size.

window_starts = timestamps[:, 0].to(
device=self.inv_freq.device, dtype=self.inv_freq.dtype
)
window_duration = self.config.audio_frame_step * 4 * seq_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The hardcoded factor 4 in the window_duration calculation appears to mismatch the actual stride of the audio encoder. MusicFlamingoEncoder inherits from AudioFlamingo3Encoder, which includes an avg_pooler with stride 2 in addition to two convolutional layers with stride 2, resulting in a total stride of 8. If the encoder stride is 8, this factor should be updated to 8 to correctly calculate the duration of a chunk (e.g., 0.01 * 8 * 375 = 30s). Using 4 with a stride-8 encoder will result in an incorrect window_duration of 15s, leading to wrong RoPE timestamps for subsequent chunks.

Suggested change
window_duration = self.config.audio_frame_step * 4 * seq_len
window_duration = self.config.audio_frame_step * 8 * seq_len

mm_data=mm_data,
mm_kwargs=mm_kwargs,
tok_kwargs=tok_kwargs,
audio_embed_frame_step = audio_frame_step * 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the window_duration calculation, the audio_embed_frame_step factor should match the total stride of the audio encoder. If the encoder uses the avg_pooler from AudioFlamingo3Encoder, the total stride is 8. Using a factor of 4 here will cause a mismatch in the absolute timestamps generated for audio chunks, which will negatively impact the RoPE embeddings and model performance.

Suggested change
audio_embed_frame_step = audio_frame_step * 4
audio_embed_frame_step = audio_frame_step * 8

Comment on lines +104 to +108
partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
head_dim = getattr(config, "head_dim", None) or (
config.hidden_size // config.num_attention_heads
)
dim = int(head_dim * partial_rotary_factor)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The partial_rotary_factor currently defaults to 1.0 if not specified in the config. However, the RoPE implementation in MusicFlamingoRotaryEmbedding uses torch.cat((window_freqs, time_freqs), dim=-1), which results in a total rotation dimension of 2 * dim. If partial_rotary_factor is 1.0, then dim = head_dim, and the concatenated frequency tensor will have size 2 * head_dim. This exceeds the feature dimension of the hidden states (which is head_dim) and will cause a runtime crash in apply_rotary_time_emb during multiplication. For this model architecture, the factor should default to 0.5 to ensure the total rotation dimension matches head_dim.

Suggested change
partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0)
head_dim = getattr(config, "head_dim", None) or (
config.hidden_size // config.num_attention_heads
)
dim = int(head_dim * partial_rotary_factor)
partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 0.5)
head_dim = getattr(config, "head_dim", None) or (
config.hidden_size // config.num_attention_heads
)
dim = int(head_dim * partial_rotary_factor)

@lashahub lashahub marked this pull request as draft April 6, 2026 02:07
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
@lashahub lashahub marked this pull request as ready for review April 13, 2026 08:36
@lashahub
Copy link
Copy Markdown
Contributor Author

Update: changed code to match the recent HF changes for Music Flamingo and Audio Flamingo Next.

  • kept the shared rotary-time embedding buffers in FP32 in vllm/model_executor/models/musicflamingo.py
  • refreshed the AF-Next expected outputs
  • added checks in the processing tests so those buffers stay FP32 after dtype conversion

Reason: vLLM was casting these buffers to BF16, which caused output differences from HF.

Comment thread tests/models/multimodal/generation/test_audioflamingonext.py Outdated
Comment thread vllm/model_executor/models/audioflamingonext.py Outdated
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
@lashahub lashahub changed the title Update MusicFlamingo and add AudioFlamingoNext Support Audio Flamingo Next checkpoints May 22, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lashahub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) needs-rebase new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants