Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| return_tensors=None if use_cb else "pt", | ||
| return_dict=True, | ||
| tokenize=True, | ||
| load_audio_from_video=modality == Modality.MULTIMODAL and has_video, |
There was a problem hiding this comment.
managed to use it but torchcodec is required for that. Otherwise, it fails back to the other lib and it fails. Also torchcodec + ffmpeg was a bit of a pain to install correctly.
Also, we should maybe should force the user to install torchcodec when using this no cc @eustlb ?
There was a problem hiding this comment.
Indeed load_audio only works with torchcodec as video containers are not supported by librosa. Agree we need to raise a clear error, let me raise a PR for that
| if load_audio_from_video and not is_torchcodec_available(): | ||
| raise ValueError( | ||
| "Extracting audio from video requires `torchcodec`. Install it with: `pip install torchcodec`." | ||
| ) |
There was a problem hiding this comment.
I guess this can be removed, better to locate the error in load_audio
| All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``. | ||
| MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``. | ||
| For LLMs, the content parts are collapsed into a plain text string. |
There was a problem hiding this comment.
My issue with this is that ALMs are seen as a sub-category of omni, while is the case of gemma4 but for others models too, we can use the ALM and VLM separately, and together. This makes even more sense knowing that audio + vision is emergent capability: the model as not been trained on both
Maybe we should have a |
What does this PR do?
This PR adds transformers serve compatibility to multimodal models like qwen omni or gemma 4. We add support for audio with chat completion and response though
input_audio-> the client need to base64-encode the audio and send it as input_audio.For video, OpenAI API doesn't natively support
video_urlas a content type. So we extended it so that we cal still play with it. For simplicity also, we also allow to pass url for audio throughaudio_url.Results (tested with
google/gemma-4-E2B-itandQwen/Qwen2.5-Omni-3B)