Support Audio Flamingo Next checkpoints#39011
Conversation
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
Documentation preview: https://vllm--39011.org.readthedocs.build/en/39011/ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the AudioFlamingoNext model and refactors the MusicFlamingo implementation to dynamically calculate audio timestamps rather than requiring them as input. The changes include new model registration, example scripts, and comprehensive tests. The review feedback identifies critical issues in the refactored logic: hardcoded stride factors (using 4 instead of 8) in the window duration and frame step calculations will result in incorrect temporal embeddings, and the default partial rotary factor must be set to 0.5 to prevent a runtime crash caused by the concatenated rotation dimensions exceeding the hidden state size.
| window_starts = timestamps[:, 0].to( | ||
| device=self.inv_freq.device, dtype=self.inv_freq.dtype | ||
| ) | ||
| window_duration = self.config.audio_frame_step * 4 * seq_len |
There was a problem hiding this comment.
The hardcoded factor 4 in the window_duration calculation appears to mismatch the actual stride of the audio encoder. MusicFlamingoEncoder inherits from AudioFlamingo3Encoder, which includes an avg_pooler with stride 2 in addition to two convolutional layers with stride 2, resulting in a total stride of 8. If the encoder stride is 8, this factor should be updated to 8 to correctly calculate the duration of a chunk (e.g., 0.01 * 8 * 375 = 30s). Using 4 with a stride-8 encoder will result in an incorrect window_duration of 15s, leading to wrong RoPE timestamps for subsequent chunks.
| window_duration = self.config.audio_frame_step * 4 * seq_len | |
| window_duration = self.config.audio_frame_step * 8 * seq_len |
| mm_data=mm_data, | ||
| mm_kwargs=mm_kwargs, | ||
| tok_kwargs=tok_kwargs, | ||
| audio_embed_frame_step = audio_frame_step * 4 |
There was a problem hiding this comment.
Similar to the window_duration calculation, the audio_embed_frame_step factor should match the total stride of the audio encoder. If the encoder uses the avg_pooler from AudioFlamingo3Encoder, the total stride is 8. Using a factor of 4 here will cause a mismatch in the absolute timestamps generated for audio chunks, which will negatively impact the RoPE embeddings and model performance.
| audio_embed_frame_step = audio_frame_step * 4 | |
| audio_embed_frame_step = audio_frame_step * 8 |
| partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0) | ||
| head_dim = getattr(config, "head_dim", None) or ( | ||
| config.hidden_size // config.num_attention_heads | ||
| ) | ||
| dim = int(head_dim * partial_rotary_factor) |
There was a problem hiding this comment.
The partial_rotary_factor currently defaults to 1.0 if not specified in the config. However, the RoPE implementation in MusicFlamingoRotaryEmbedding uses torch.cat((window_freqs, time_freqs), dim=-1), which results in a total rotation dimension of 2 * dim. If partial_rotary_factor is 1.0, then dim = head_dim, and the concatenated frequency tensor will have size 2 * head_dim. This exceeds the feature dimension of the hidden states (which is head_dim) and will cause a runtime crash in apply_rotary_time_emb during multiplication. For this model architecture, the factor should default to 0.5 to ensure the total rotation dimension matches head_dim.
| partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 1.0) | |
| head_dim = getattr(config, "head_dim", None) or ( | |
| config.hidden_size // config.num_attention_heads | |
| ) | |
| dim = int(head_dim * partial_rotary_factor) | |
| partial_rotary_factor = config.rope_parameters.get("partial_rotary_factor", 0.5) | |
| head_dim = getattr(config, "head_dim", None) or ( | |
| config.hidden_size // config.num_attention_heads | |
| ) | |
| dim = int(head_dim * partial_rotary_factor) |
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
Update: changed code to match the recent HF changes for Music Flamingo and Audio Flamingo Next.
Reason: vLLM was casting these buffers to BF16, which caused output differences from HF. |
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
This PR updates MusicFlamingo (MF) and adds AudioFlamingoNext (AF-Next) support to vLLM.
AF-Next is implemented on top of the corrected MF implementation, since the two models are architecturally very close in the current Hugging Face reference.
Why this PR is needed
AF-Next is being added to Hugging Face Transformers here:
huggingface/transformers#44830
The Hugging Face MusicFlamingo integration is here:
huggingface/transformers#43538
MF was already merged into vLLM earlier, but the HF implementation changed after that PR. As a result, some follow-up MF updates were necessary here so the vLLM path stays closely aligned with the current Transformers behavior.
Since AF-Next is close to MF, the right integration in vLLM is to reuse the updated MF path instead of introducing a separate divergent implementation.
What this PR changes
AudioFlamingoNextForConditionalGenerationsupport to the vLLM registrynvidia/audio-flamingo-next-hfdocs/models/supported_models.mdexamples/offline_inference/audio_language.pyTest plan
Test results
tests/models/multimodal/processing/test_audioflamingonext.py: passedtests/models/multimodal/generation/test_audioflamingonext.py: passedEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.