Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling#37643
Conversation
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--37643.org.readthedocs.build/en/37643/ |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for the MusicFlamingo model, including its integration into the vLLM framework, updated documentation, and dedicated test suites. The changes involve adding MusicFlamingoForConditionalGeneration to the list of supported models, implementing its specific multimodal processing logic (including rotary embeddings and audio token handling), and creating new test fixtures and generation tests to validate its functionality. Additionally, the AudioFlamingo3 model's processing and testing were refined, with updates to expected outputs, dummy data generation, and the introduction of helper functions for output assertion, ensuring consistency and correctness across both audio models. The min_transformers_version for MusicFlamingo was also updated to reflect its requirements.
DarkLight1337
left a comment
There was a problem hiding this comment.
Thanks, left some initial comments
|
Thanks, I don't have any more concerns. Can you fix pre-commit though? |
I reran |
|
Sorry it still fails |
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
DCO passes now. |
|
Sorry for the confusion, let's see if it passes after I add ready label |
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
|
Hi there, just wondering: why does this model require minimum transformers 5.3.0? https://github.com/vllm-project/vllm/pull/37643/changes#diff-c2cd72327248d1c1aa3d4b29ec9e47314d9893bfeff94e927841cd640fac84c1R755 |
|
Thanks! There's a function call in a test that is missing |
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
…oject#37643) Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
Summary
This PR fixes and completes the AudioFlamingo3 (AF3) / MusicFlamingo (MF) implementation in vLLM so it aligns closely with the current Hugging Face reference behavior; AF3 was originally introduced in vLLM in #30539.
Revisits #35522.
Why this PR is needed
The current MF support in vLLM came from two earlier PRs:
ValueError: Following weights were not initialized from checkpoint: {'audio_tower.pos_emb.freqs'}#35522Those changes helped partially, but they did not fully implement MusicFlamingo correctly.
The core problem is that MusicFlamingo was still effectively routed through the AF3 path. That was enough to make the model appear supported, and later to make it load more cleanly, but not enough to match the actual HF processor/model semantics.
What was incomplete before
#32696
#32696 introduced a thin wrapper-style MusicFlamingo adapter, but it did not implement the MF-specific behavior that exists in the HF reference, including:
rote_timestampsrope_parameters/head_dimhandlingIt also added a dummy
audio_tower.pos_emb.freqscompatibility parameter to make MF checkpoints load, which was a workaround rather than a faithful implementation.#35522 / #35535
#35522 reported the loading failure:
That issue was real, but it was a symptom of the earlier wrapper design.
#35535 removed that immediate loading failure, which was useful, but it still did not implement the actual MF semantics. It improved loading, not full correctness.
What this PR changes
AudioFlamingo3
MusicFlamingo
rote_timestampscorrectlyrope_parametersandhead_dimValidation
This PR adds and runs the missing regression coverage:
nvidia/audio-flamingo-3-hfnvidia/music-flamingo-2601-hfUpstream alignment
MusicFlamingo is currently in the final stages of being merged into Transformers in
huggingface/transformers#43538:huggingface/transformers#43538
Small follow-up refinements may still be needed if upstream lands minor changes before final merge, but this PR is already very close to the current HF reference and is intended to be the correct baseline for AF3/MF support in vLLM.
Takeaway
#32696 and #35535 addressed parts of the problem, but they did not fully implement MusicFlamingo correctly.
This PR replaces that partial state with:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.