feat(vllm-omni): add multimodal benchmark suite and CI workflow#6022
Merged
Conversation
Adds lightweight async benchmark clients for vLLM-Omni's modality endpoints and a dispatch workflow that runs them in CI: - TTS (/v1/audio/speech) — CustomVoice + Base (voice cloning) - Image (/v1/images/generations) - Video (/v1/videos with async submit+poll) - Chat (/v1/chat/completions, SSE streaming — reports TTFT / TPOT / ITL) Each client emits a uniform JSON summary; the dispatcher script validates configurable thresholds and benchmark_report.py aggregates artifacts into a markdown table for $GITHUB_STEP_SUMMARY. vllm-omni-model-tests.yml gains a benchmark: section alongside smoke-test:, driving a new dispatch-vllm-omni-benchmark.yml workflow. The existing dispatch-vllm-benchmark.yml (LLM) is untouched. The chat client uses SSE streaming so that TTFT is measurable; without SSE only E2E is visible and scheduler/batching regressions are hidden. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
added 20 commits
April 30, 2026 11:31
workflow_dispatch is only available for the default branch, so it cannot be used to test this workflow on the PR branch. Add a scoped pull_request trigger so the benchmark runs on PRs that touch the workflow, dispatcher, or benchmark clients. The pull_request block is marked TEMPORARY with a clear comment — remove it once the workflow has run successfully against a PR. IMAGE_URI falls back to the vllm:omni-cuda-v1 DLC image used in the baseline devbox runs when no workflow_dispatch input is provided. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…el name - Relax TTS/image benchmark thresholds to match g6.xlarge (L4 24GB) performance instead of g6e.xlarge (L40S) baselines - Pass model name to chat benchmark client to fix 'model default does not exist' 404 error for qwen2.5-omni-3b Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…e thresholds - Use iter_any() chunk reading instead of line iteration to avoid aiohttp's 128KB line-size limit (omni models emit large audio frames) - Further relax qwen3-tts-12hz-1.7b-base thresholds — performance on L4 is highly variable (0.075-0.244 RPS across runs) Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Validation script now writes threshold_passed into the result JSON so the report step can distinguish threshold failures from request failures. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model produces corrupt output on L4 — Code2Wav receives malformed input_ids (length 1, not divisible by num_quantizers 16) due to ONNX/CUDA graph issues on Ada Lovelace. Restore original L40S thresholds. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Omni model has fast token generation (157 tps) but slow 3-stage audio synthesis (thinker→talker→code2wav), making E2E ~92s per request. Adjust min_rps and max_p95_e2e_ms to reflect actual omni-chat behavior. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model intermittently produces malformed input_ids (length 1) on both L4 and L40S — this is a model/vllm-omni bug, not hardware-specific. Relax thresholds to tolerate the failure rate while still catching complete regressions. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…script The tts-base voice-cloning model requires ref_text to be the exact transcript of the reference audio. The wav says 'The quick brown fox jumps over the lazy dog near the riverbank at sunset.' not 'Hello, how are you?' — mismatched transcript causes Code2Wav to produce malformed output. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Now that ref_text matches the actual audio transcript, restore the original L40S thresholds (min_rps=1.3, min_audio_rtf_mult=4.5, max_p95_e2e_ms=3000). Added comments about the ref_text requirement and link to vllm-omni#3124. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…coding Download .txt file alongside .wav from S3 (same path, .wav→.txt) to get the exact transcript. Falls back to config ref_text if .txt not found. Prevents future inconsistency between audio and transcript. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
TTFT p95 was concatenated into the p50 E2E column making it confusing. Give it a dedicated column instead. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model works correctly now (ref_text fix). Actual: 1.272 RPS, 3666ms p95. Add margin: min_rps 1.3→1.0, max_p95_e2e_ms 3000→4500. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
S3 .txt download may be failing silently. Add ref_text back to config as fallback and echo the resolved value for debugging. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
sirutBuasai
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Adds a multimodal benchmark suite for the vLLM-Omni DLC image, with per-modality async clients and a dispatch workflow that parallels the
existing
dispatch-vllm-benchmark.ymlbut keeps it untouched.Problem: vLLM-Omni exposes four distinct modality endpoints (TTS, image, video, chat), and upstream
vllm bench serve --omnionly targets/v1/chat/completions. We previously had no CI coverage for the other three, and no single reporter shape that handles all four.Solution: four lightweight
aiohttp-based clients that speak OpenAI-compatible routes directly on the container's port 8080, a dispatcherwrapper that picks the right client by
benchmark_type, and a workflow that runs them from thebenchmark:section ofvllm-omni-model-tests.yml.Clients
tts_benchmark_client.pyPOST /v1/audio/speechimage_benchmark_client.pyPOST /v1/images/generationsvideo_benchmark_client.pyPOST /v1/videos→ pollchat_omni_benchmark_client.pyPOST /v1/chat/completions(SSE)The chat client uses streaming SSE (
stream=true) because the most important user-perceived metric for chat is TTFT (Time To First Token),which is only measurable when the server streams tokens incrementally. Metrics match what
vllm bench serve --omni --backend openai-chat-omnireports so numbers are directly comparable.Why not use
vllm bench servevllm bench serve --omnionly supports/v1/chat/completions; it can't benchmark TTS speech, image, or video endpointsvllm-omniinstalled on the CI runner, which drags in torch/cuda deps we don't otherwise needBenchmark matrix (5 models in
vllm-omni-model-tests.yml)x86-g6xl-runnerttsx86-g6xl-runnertts-base.wav/.txt) ats3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.*.x86-g6xl-runnerimagex86-g6exl-runnervideox86-g6e12xl-runnerchatThresholds (
min_rps,min_output_tps,max_p95_ttft_ms, etc.) are declared per model inbenchmark_configand validated by thedispatcher.
Note: CI runs on the fleet-appropriate hardware per model. The devbox sweeps below were collected on
g6e.xlarge(L40S) as a referenceceiling; CI numbers on smaller fleets (e.g.
x86-g6xl-runner/ L4) are expected to be lower.Test Plan
g6e.xlarge, L40S 46 GB) againstvllm:omni-cuda-v1:qwen3-tts-1.7b-customvoice— concurrency sweep c={1,4,8,16}, peak 4.63 req/s / 18× real-time audioqwen3-tts-12hz-1.7b-base— concurrency sweep c={1,4,8,16}, peak 2.07 req/s / 6.83× real-time audio; saturates at c=8flux2-klein-4b— concurrency sweep c={1,2,4}, flat 0.35 images/s (GPU-bound)wan2.1-t2v-1.3b— concurrency sweep c={1,2}, flat 0.66 videos/s (GPU-bound)benchmark_report.pytested locally against real JSON artifacts — produces a valid markdown table for$GITHUB_STEP_SUMMARYbenchmark_typecases (tts,tts-base,image,video,chat)s3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.{wav,txt}uploaded and readable by thetts-basepathpre-commit runpasses on all changed files (ruff, mdformat, shfmt, actionlint, check-github-workflows, detect-aws-credentials,gitleaks, etc.)
PASS. Thresholds in the YAML are calibrated against the numbers below.Test Result
CI run (GitHub Actions, per-fleet CodeBuild runners)
flux2-klein-4b/x86-g6xl-runnerqwen2.5-omni-3b/x86-g6e12xl-runnerqwen3-tts-1.7b-customvoice/x86-g6xl-runnerqwen3-tts-12hz-1.7b-base/x86-g6exl-runnerwan2.1-t2v-1.3b/x86-g6exl-runnerDevbox concurrency sweeps (
g6e.xlarge, L40S 46 GB,vllm:omni-cuda-v1)qwen3-tts-1.7b-customvoice c=1: 1.14 req/s, 4.08x RT, p95 1123 ms
qwen3-tts-1.7b-customvoice c=4: 2.10 req/s, 8.17x RT, p95 2387 ms
qwen3-tts-1.7b-customvoice c=8: 3.47 req/s, 13.00x RT, p95 3637 ms
qwen3-tts-1.7b-customvoice c=16: 4.63 req/s, 18.00x RT, p95 4237 ms
qwen3-tts-12hz-1.7b-base c=1: 1.03 req/s, 3.06x RT, p95 1201 ms
qwen3-tts-12hz-1.7b-base c=4: 1.86 req/s, 6.00x RT, p95 2532 ms
qwen3-tts-12hz-1.7b-base c=8: 2.07 req/s, 6.83x RT, p95 4297 ms (RTF ~1.02, saturated)
flux2-klein-4b c=1: 0.342 images/s, p95 2944 ms
flux2-klein-4b c=2: 0.347 images/s, p95 5863 ms (pure queue)
wan2.1-t2v-1.3b c=1: 0.664 videos/s, p95 1508 ms (server inference 1.38 s)
wan2.1-t2v-1.3b c=2: 0.662 videos/s, p95 3044 ms (pure queue)
Thresholds in the YAML are calibrated from the CI run above with ~20–25% margin for noise.
PR Checklist
pre-commit run --all-fileslocally before creating this PR.