feat(vllm-omni): add multimodal benchmark suite and CI workflow by Yadan-Wei · Pull Request #6022 · aws/deep-learning-containers

Yadan-Wei · 2026-04-30T18:23:16Z

Purpose

Adds a multimodal benchmark suite for the vLLM-Omni DLC image, with per-modality async clients and a dispatch workflow that parallels the
existing dispatch-vllm-benchmark.yml but keeps it untouched.

Problem: vLLM-Omni exposes four distinct modality endpoints (TTS, image, video, chat), and upstream vllm bench serve --omni only targets
/v1/chat/completions. We previously had no CI coverage for the other three, and no single reporter shape that handles all four.

Solution: four lightweight aiohttp-based clients that speak OpenAI-compatible routes directly on the container's port 8080, a dispatcher
wrapper that picks the right client by benchmark_type, and a workflow that runs them from the benchmark: section of
vllm-omni-model-tests.yml.

Clients

Script	Endpoint	Metrics
`tts_benchmark_client.py`	`POST /v1/audio/speech`	TTFB, E2E, audio duration, RTF, req/s — supports CustomVoice and Base (voice cloning, downloads ref audio + transcript from S3)
`image_benchmark_client.py`	`POST /v1/images/generations`	E2E, images/s
`video_benchmark_client.py`	`POST /v1/videos` → poll	submit latency, server inference time, E2E, videos/s
`chat_omni_benchmark_client.py`	`POST /v1/chat/completions` (SSE)	TTFT, TPOT, ITL, E2E, req/s, output tokens/s

The chat client uses streaming SSE (stream=true) because the most important user-perceived metric for chat is TTFT (Time To First Token),
which is only measurable when the server streams tokens incrementally. Metrics match what vllm bench serve --omni --backend openai-chat-omni reports so numbers are directly comparable.

Why not use `vllm bench serve`

vllm bench serve --omni only supports /v1/chat/completions; it can't benchmark TTS speech, image, or video endpoints
It requires vllm-omni installed on the CI runner, which drags in torch/cuda deps we don't otherwise need
Keeping all four clients in the same shape lets one dispatcher + reporter handle every modality

Benchmark matrix (5 models in `vllm-omni-model-tests.yml`)

Model	Fleet	Type	Notes
qwen3-tts-1.7b-customvoice	`x86-g6xl-runner`	`tts`	preset voice
qwen3-tts-12hz-1.7b-base	`x86-g6xl-runner`	`tts-base`	voice cloning; reference WAV + transcript (`.wav` / `.txt`) at `s3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.*`.
flux2-klein-4b	`x86-g6xl-runner`	`image`	512×512
wan2.1-t2v-1.3b	`x86-g6exl-runner`	`video`	480×320, 17 frames, 4 steps
qwen2.5-omni-3b	`x86-g6e12xl-runner`	`chat`	streaming SSE, requires 4-GPU for thinker/talker/code2wav pipeline

Thresholds (min_rps, min_output_tps, max_p95_ttft_ms, etc.) are declared per model in benchmark_config and validated by the
dispatcher.

Note: CI runs on the fleet-appropriate hardware per model. The devbox sweeps below were collected on g6e.xlarge (L40S) as a reference
ceiling; CI numbers on smaller fleets (e.g. x86-g6xl-runner / L4) are expected to be lower.

Test Plan

Benchmark clients validated end-to-end on a devbox (g6e.xlarge, L40S 46 GB) against vllm:omni-cuda-v1:
- qwen3-tts-1.7b-customvoice — concurrency sweep c={1,4,8,16}, peak 4.63 req/s / 18× real-time audio
- qwen3-tts-12hz-1.7b-base — concurrency sweep c={1,4,8,16}, peak 2.07 req/s / 6.83× real-time audio; saturates at c=8
- flux2-klein-4b — concurrency sweep c={1,2,4}, flat 0.35 images/s (GPU-bound)
- wan2.1-t2v-1.3b — concurrency sweep c={1,2}, flat 0.66 videos/s (GPU-bound)
Reporter benchmark_report.py tested locally against real JSON artifacts — produces a valid markdown table for $GITHUB_STEP_SUMMARY
Dispatcher smoke-tested with all five benchmark_type cases (tts, tts-base, image, video, chat)
S3 fixtures s3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.{wav,txt} uploaded and readable by the tts-base path
pre-commit run passes on all changed files (ruff, mdformat, shfmt, actionlint, check-github-workflows, detect-aws-credentials,
gitleaks, etc.)
CI dispatch on all five models completed — all PASS. Thresholds in the YAML are calibrated against the numbers below.

Test Result

CI run (GitHub Actions, per-fleet CodeBuild runners)

Model / Fleet	Status	Req/s	Throughput	p50 E2E (ms)	p95 E2E (ms)	p99 E2E (ms)	TTFT p95 (ms)
`flux2-klein-4b` / `x86-g6xl-runner`	PASS	0.098	0.098 img/s	10183.45	10287.91	10287.91	—
`qwen2.5-omni-3b` / `x86-g6e12xl-runner`	PASS	0.028	158.36 tok/s	70322.24	91547.04	98614.59	47.25
`qwen3-tts-1.7b-customvoice` / `x86-g6xl-runner`	PASS	1.678	6.029× RT	2125.67	2752.81	2798.53	—
`qwen3-tts-12hz-1.7b-base` / `x86-g6exl-runner`	PASS	1.183	4.595× RT	3193.46	3980.63	4280.94	—
`wan2.1-t2v-1.3b` / `x86-g6exl-runner`	PASS	0.664	0.664 vid/s	1506.80	1507.50	1507.50	—

Devbox concurrency sweeps (`g6e.xlarge`, L40S 46 GB, `vllm:omni-cuda-v1`)

qwen3-tts-1.7b-customvoice c=1: 1.14 req/s, 4.08x RT, p95 1123 ms
qwen3-tts-1.7b-customvoice c=4: 2.10 req/s, 8.17x RT, p95 2387 ms
qwen3-tts-1.7b-customvoice c=8: 3.47 req/s, 13.00x RT, p95 3637 ms
qwen3-tts-1.7b-customvoice c=16: 4.63 req/s, 18.00x RT, p95 4237 ms

qwen3-tts-12hz-1.7b-base c=1: 1.03 req/s, 3.06x RT, p95 1201 ms
qwen3-tts-12hz-1.7b-base c=4: 1.86 req/s, 6.00x RT, p95 2532 ms
qwen3-tts-12hz-1.7b-base c=8: 2.07 req/s, 6.83x RT, p95 4297 ms (RTF ~1.02, saturated)

flux2-klein-4b c=1: 0.342 images/s, p95 2944 ms
flux2-klein-4b c=2: 0.347 images/s, p95 5863 ms (pure queue)

wan2.1-t2v-1.3b c=1: 0.664 videos/s, p95 1508 ms (server inference 1.38 s)
wan2.1-t2v-1.3b c=2: 0.662 videos/s, p95 3044 ms (pure queue)

Thresholds in the YAML are calibrated from the CI run above with ~20–25% margin for noise.

PR Checklist

I ran pre-commit run --all-files locally before creating this PR.
Full CI benchmark dispatch run — all 5 models PASS.

Adds lightweight async benchmark clients for vLLM-Omni's modality endpoints and a dispatch workflow that runs them in CI: - TTS (/v1/audio/speech) — CustomVoice + Base (voice cloning) - Image (/v1/images/generations) - Video (/v1/videos with async submit+poll) - Chat (/v1/chat/completions, SSE streaming — reports TTFT / TPOT / ITL) Each client emits a uniform JSON summary; the dispatcher script validates configurable thresholds and benchmark_report.py aggregates artifacts into a markdown table for $GITHUB_STEP_SUMMARY. vllm-omni-model-tests.yml gains a benchmark: section alongside smoke-test:, driving a new dispatch-vllm-omni-benchmark.yml workflow. The existing dispatch-vllm-benchmark.yml (LLM) is untouched. The chat client uses SSE streaming so that TTFT is measurable; without SSE only E2E is visible and scheduler/batching regressions are hidden. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

workflow_dispatch is only available for the default branch, so it cannot be used to test this workflow on the PR branch. Add a scoped pull_request trigger so the benchmark runs on PRs that touch the workflow, dispatcher, or benchmark clients. The pull_request block is marked TEMPORARY with a clear comment — remove it once the workflow has run successfully against a PR. IMAGE_URI falls back to the vllm:omni-cuda-v1 DLC image used in the baseline devbox runs when no workflow_dispatch input is provided. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…el name - Relax TTS/image benchmark thresholds to match g6.xlarge (L4 24GB) performance instead of g6e.xlarge (L40S) baselines - Pass model name to chat benchmark client to fix 'model default does not exist' 404 error for qwen2.5-omni-3b Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…e thresholds - Use iter_any() chunk reading instead of line iteration to avoid aiohttp's 128KB line-size limit (omni models emit large audio frames) - Further relax qwen3-tts-12hz-1.7b-base thresholds — performance on L4 is highly variable (0.075-0.244 RPS across runs) Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Validation script now writes threshold_passed into the result JSON so the report step can distinguish threshold failures from request failures. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Model produces corrupt output on L4 — Code2Wav receives malformed input_ids (length 1, not divisible by num_quantizers 16) due to ONNX/CUDA graph issues on Ada Lovelace. Restore original L40S thresholds. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Omni model has fast token generation (157 tps) but slow 3-stage audio synthesis (thinker→talker→code2wav), making E2E ~92s per request. Adjust min_rps and max_p95_e2e_ms to reflect actual omni-chat behavior. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Model intermittently produces malformed input_ids (length 1) on both L4 and L40S — this is a model/vllm-omni bug, not hardware-specific. Relax thresholds to tolerate the failure rate while still catching complete regressions. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…script The tts-base voice-cloning model requires ref_text to be the exact transcript of the reference audio. The wav says 'The quick brown fox jumps over the lazy dog near the riverbank at sunset.' not 'Hello, how are you?' — mismatched transcript causes Code2Wav to produce malformed output. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Now that ref_text matches the actual audio transcript, restore the original L40S thresholds (min_rps=1.3, min_audio_rtf_mult=4.5, max_p95_e2e_ms=3000). Added comments about the ref_text requirement and link to vllm-omni#3124. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…coding Download .txt file alongside .wav from S3 (same path, .wav→.txt) to get the exact transcript. Falls back to config ref_text if .txt not found. Prevents future inconsistency between audio and transcript. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

TTFT p95 was concatenated into the p50 E2E column making it confusing. Give it a dedicated column instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Model works correctly now (ref_text fix). Actual: 1.272 RPS, 3666ms p95. Add margin: min_rps 1.3→1.0, max_p95_e2e_ms 3000→4500. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

S3 .txt download may be failing silently. Add ref_text back to config as fallback and echo the resolved value for debugging. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label Apr 30, 2026

Yadan Wei added 20 commits April 30, 2026 11:31

disable pr workflow

aac86cf

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix(benchmark): report threshold failures correctly in summary

f470404

Validation script now writes threshold_passed into the result JSON so the report step can distinguish threshold failures from request failures. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix(benchmark): move TTFT to its own column in report table

4c01f48

TTFT p95 was concatenated into the p50 E2E column making it confusing. Give it a dedicated column instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix(benchmark): add ~25% margin to tts-base thresholds for CI noise

3ce0778

Model works correctly now (ref_text fix). Actual: 1.272 RPS, 3666ms p95. Add margin: min_rps 1.3→1.0, max_p95_e2e_ms 3000→4500. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix(benchmark): add ref_text fallback in config, add debug echo

801ace4

S3 .txt download may be failing silently. Add ref_text back to config as fallback and echo the resolved value for debugging. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

fix(benchmark): strip trailing newline from S3 transcript file

6624031

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

test gexl base model stable

8160ede

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

update base benchmark value

1e45de2

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

reenable pr workflow

14958dd

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

reenable pr workflow

7e7b75f

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

pacth CVEs

67bdf54

Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Yadan-Wei marked this pull request as ready for review May 1, 2026 23:11

Yadan-Wei enabled auto-merge (squash) May 1, 2026 23:44

Merge branch 'main' into omni-benchmark

4a2d707

sirutBuasai approved these changes May 5, 2026

View reviewed changes

Yadan-Wei merged commit 0a11bb6 into main May 5, 2026
36 checks passed

Yadan-Wei deleted the omni-benchmark branch May 5, 2026 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm-omni): add multimodal benchmark suite and CI workflow#6022

feat(vllm-omni): add multimodal benchmark suite and CI workflow#6022
Yadan-Wei merged 22 commits into
mainfrom
omni-benchmark

Yadan-Wei commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yadan-Wei commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Clients

Why not use vllm bench serve

Benchmark matrix (5 models in vllm-omni-model-tests.yml)

Test Plan

Test Result

CI run (GitHub Actions, per-fleet CodeBuild runners)

Devbox concurrency sweeps (g6e.xlarge, L40S 46 GB, vllm:omni-cuda-v1)

PR Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yadan-Wei commented Apr 30, 2026 •

edited

Loading

Why not use `vllm bench serve`

Benchmark matrix (5 models in `vllm-omni-model-tests.yml`)

Devbox concurrency sweeps (`g6e.xlarge`, L40S 46 GB, `vllm:omni-cuda-v1`)