Multimodal serve support by SunMarc · Pull Request #45220 · huggingface/transformers

SunMarc · 2026-04-03T14:16:33Z

What does this PR do?

This PR adds transformers serve compatibility to multimodal models like qwen omni or gemma 4. We add support for audio with chat completion and response though input_audio -> the client need to base64-encode the audio and send it as input_audio.

For video, OpenAI API doesn't natively support video_url as a content type. So we extended it so that we cal still play with it. For simplicity also, we also allow to pass url for audio through audio_url.

Results (tested with `google/gemma-4-E2B-it` and `Qwen/Qwen2.5-Omni-3B`)

import base64
import socket
import time

import httpx
from openai import OpenAI

from transformers.cli.serve import Serve

AUDIO_URL = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"
VIDEO_URL = "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/concert.mp4"
MODEL = "google/gemma-4-E2B-it"

# Qwen Omni
# MODEL = "Qwen/Qwen2.5-Omni-3B"


def find_free_port():
    with socket.socket() as s:
        s.bind(("", 0))
        return s.getsockname()[1]


def start_serve():
    port = find_free_port()
    serve = Serve(port=port, non_blocking=True)
    for _ in range(30):
        try:
            if httpx.get(f"http://localhost:{port}/health", timeout=2).status_code == 200:
                return serve, port
        except Exception:
            pass
        time.sleep(1)
    raise RuntimeError("Server did not start in time")


serve, port = start_serve()
client = OpenAI(base_url=f"http://localhost:{port}/v1", api_key="unused")

audio_bytes = httpx.get(AUDIO_URL, follow_redirects=True).content
audio_b64 = base64.b64encode(audio_bytes).decode()

print("=== Audio via responses API ===")
resp = client.responses.create(
    model=MODEL,
    input=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "mp3"}},
            ],
        }
    ],
    stream=False,
    max_output_tokens=200,
)
print(resp.output[0].content[0].text)
print()

# --- Video with audio (responses API) ---
print("=== Video via responses API ===")
resp = client.responses.create(
    model=MODEL,
    input=[
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": VIDEO_URL}},
                {"type": "text", "text": "Transcribe the lyrics of the song being played in this video."},
            ],
        }
    ],
    stream=False,
    max_output_tokens=500,
)
print(resp.output[0].content[0].text)
print()

serve.kill_server()
print("Done!")

Audio via responses API
This week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of
presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at
all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors,
at diners, and on distant military outposts, all these conversations are what have kept me honest.

Video via responses API
(Song lyrics)

I don't care how straight
From neck to chest
We're in the same predicament
Another one wantin' is in the storm alone
I'm the one down below this
You don't wanna be my
I never thought you'd say
Of this nice sad place you've been
I don't want it my face
But I don't wanna die

HuggingFaceDocBuilderDev · 2026-04-03T14:25:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc · 2026-04-03T15:02:07Z

src/transformers/cli/serving/chat_completion.py

            return_tensors=None if use_cb else "pt",
            return_dict=True,
            tokenize=True,
+            load_audio_from_video=modality == Modality.MULTIMODAL and has_video,


managed to use it but torchcodec is required for that. Otherwise, it fails back to the other lib and it fails. Also torchcodec + ffmpeg was a bit of a pain to install correctly.
Also, we should maybe should force the user to install torchcodec when using this no cc @eustlb ?

Indeed load_audio only works with torchcodec as video containers are not supported by librosa. Agree we need to raise a clear error, let me raise a PR for that

eustlb

@SunMarc what would you think of having an ALM modality, and to differentiate:
ALM audio + text
VLM vision + text
~~MULTIMODEL~~ EDIT MULTIMODAL (my bad typo) audio + vision + text

eustlb · 2026-04-03T15:50:44Z

src/transformers/cli/serving/chat_completion.py

+        if load_audio_from_video and not is_torchcodec_available():
+            raise ValueError(
+                "Extracting audio from video requires `torchcodec`. Install it with: `pip install torchcodec`."
+            )


I guess this can be removed, better to locate the error in load_audio

eustlb · 2026-04-03T15:53:11Z

src/transformers/cli/serving/utils.py

+        All modalities extract text. VLM additionally handles ``image_url`` and ``video_url``.
+        MULTIMODAL handles all of the above plus ``input_audio`` and ``audio_url``.
+        For LLMs, the content parts are collapsed into a plain text string.


My issue with this is that ALMs are seen as a sub-category of omni, while is the case of gemma4 but for others models too, we can use the ALM and VLM separately, and together. This makes even more sense knowing that audio + vision is emergent capability: the model as not been trained on both

SunMarc · 2026-04-03T18:45:27Z

@SunMarc what would you think of having an ALM modality, and to differentiate:
ALM audio + text
VLM vision + text
MULTIMODEL audio + vision + text

Maybe we should have a MODEL_FOR_AUDIO_TEXT_MAPPING_NAMES mapping ? I created MULTIMODEL as we have this MODEL_FOR_MULTIMODAL_LM_MAPPING_NAMES . We have MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES but this is more for decoder-encoder style model, so we can't really use that no . cc @eustlb

SunMarc added 2 commits April 3, 2026 13:22

add audio and video

63bde4f

multimodal in chat and response api

a37af34

SunMarc added 2 commits April 3, 2026 14:37

fix !

872c336

remove file

2091505

SunMarc requested a review from LysandreJik April 3, 2026 14:57

style

12a94a3

SunMarc commented Apr 3, 2026

View reviewed changes

let's have this here for now

431a307

SunMarc requested a review from eustlb April 3, 2026 15:13

eustlb reviewed Apr 3, 2026

View reviewed changes

SunMarc added 3 commits April 3, 2026 18:59

qwen omni works also

85221db

more coverage for test

3d489b6

style

a9e5236

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal serve support #45220

Multimodal serve support #45220
SunMarc wants to merge 9 commits intomainfrom
audio-video-serve

SunMarc commented Apr 3, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 3, 2026

Uh oh!

SunMarc Apr 3, 2026 •

edited

Loading

Uh oh!

eustlb Apr 3, 2026

Uh oh!

eustlb Apr 3, 2026

Uh oh!

eustlb left a comment •

edited

Loading

Uh oh!

eustlb Apr 3, 2026

Uh oh!

eustlb Apr 3, 2026

Uh oh!

SunMarc commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SunMarc commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Results (tested with google/gemma-4-E2B-it and Qwen/Qwen2.5-Omni-3B)

Uh oh!

HuggingFaceDocBuilderDev commented Apr 3, 2026

Uh oh!

SunMarc Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

eustlb Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eustlb Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

eustlb Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SunMarc commented Apr 3, 2026 •

edited

Loading

Results (tested with `google/gemma-4-E2B-it` and `Qwen/Qwen2.5-Omni-3B`)

SunMarc Apr 3, 2026 •

edited

Loading

eustlb left a comment •

edited

Loading

SunMarc commented Apr 3, 2026 •

edited

Loading