Skip to content

feat(vllm-omni): add multimodal benchmark suite and CI workflow#6022

Merged
Yadan-Wei merged 22 commits into
mainfrom
omni-benchmark
May 5, 2026
Merged

feat(vllm-omni): add multimodal benchmark suite and CI workflow#6022
Yadan-Wei merged 22 commits into
mainfrom
omni-benchmark

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented Apr 30, 2026

Purpose

Adds a multimodal benchmark suite for the vLLM-Omni DLC image, with per-modality async clients and a dispatch workflow that parallels the
existing dispatch-vllm-benchmark.yml but keeps it untouched.

Problem: vLLM-Omni exposes four distinct modality endpoints (TTS, image, video, chat), and upstream vllm bench serve --omni only targets
/v1/chat/completions. We previously had no CI coverage for the other three, and no single reporter shape that handles all four.

Solution: four lightweight aiohttp-based clients that speak OpenAI-compatible routes directly on the container's port 8080, a dispatcher
wrapper that picks the right client by benchmark_type, and a workflow that runs them from the benchmark: section of
vllm-omni-model-tests.yml.

Clients

Script Endpoint Metrics
tts_benchmark_client.py POST /v1/audio/speech TTFB, E2E, audio duration, RTF, req/s — supports CustomVoice and Base (voice cloning, downloads ref audio + transcript from S3)
image_benchmark_client.py POST /v1/images/generations E2E, images/s
video_benchmark_client.py POST /v1/videos → poll submit latency, server inference time, E2E, videos/s
chat_omni_benchmark_client.py POST /v1/chat/completions (SSE) TTFT, TPOT, ITL, E2E, req/s, output tokens/s

The chat client uses streaming SSE (stream=true) because the most important user-perceived metric for chat is TTFT (Time To First Token),
which is only measurable when the server streams tokens incrementally. Metrics match what vllm bench serve --omni --backend openai-chat-omni reports so numbers are directly comparable.

Why not use vllm bench serve

  1. vllm bench serve --omni only supports /v1/chat/completions; it can't benchmark TTS speech, image, or video endpoints
  2. It requires vllm-omni installed on the CI runner, which drags in torch/cuda deps we don't otherwise need
  3. Keeping all four clients in the same shape lets one dispatcher + reporter handle every modality

Benchmark matrix (5 models in vllm-omni-model-tests.yml)

Model Fleet Type Notes
qwen3-tts-1.7b-customvoice x86-g6xl-runner tts preset voice
qwen3-tts-12hz-1.7b-base x86-g6xl-runner tts-base voice cloning; reference WAV + transcript (.wav / .txt) at s3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.*.
flux2-klein-4b x86-g6xl-runner image 512×512
wan2.1-t2v-1.3b x86-g6exl-runner video 480×320, 17 frames, 4 steps
qwen2.5-omni-3b x86-g6e12xl-runner chat streaming SSE, requires 4-GPU for thinker/talker/code2wav pipeline

Thresholds (min_rps, min_output_tps, max_p95_ttft_ms, etc.) are declared per model in benchmark_config and validated by the
dispatcher.

Note: CI runs on the fleet-appropriate hardware per model. The devbox sweeps below were collected on g6e.xlarge (L40S) as a reference
ceiling; CI numbers on smaller fleets (e.g. x86-g6xl-runner / L4) are expected to be lower.

Test Plan

  • Benchmark clients validated end-to-end on a devbox (g6e.xlarge, L40S 46 GB) against vllm:omni-cuda-v1:
    • qwen3-tts-1.7b-customvoice — concurrency sweep c={1,4,8,16}, peak 4.63 req/s / 18× real-time audio
    • qwen3-tts-12hz-1.7b-base — concurrency sweep c={1,4,8,16}, peak 2.07 req/s / 6.83× real-time audio; saturates at c=8
    • flux2-klein-4b — concurrency sweep c={1,2,4}, flat 0.35 images/s (GPU-bound)
    • wan2.1-t2v-1.3b — concurrency sweep c={1,2}, flat 0.66 videos/s (GPU-bound)
  • Reporter benchmark_report.py tested locally against real JSON artifacts — produces a valid markdown table for $GITHUB_STEP_SUMMARY
  • Dispatcher smoke-tested with all five benchmark_type cases (tts, tts-base, image, video, chat)
  • S3 fixtures s3://dlc-cicd-models/test-fixtures/audio/tts_ref_vivian.{wav,txt} uploaded and readable by the tts-base path
  • pre-commit run passes on all changed files (ruff, mdformat, shfmt, actionlint, check-github-workflows, detect-aws-credentials,
    gitleaks, etc.)
  • CI dispatch on all five models completed — all PASS. Thresholds in the YAML are calibrated against the numbers below.

Test Result

CI run (GitHub Actions, per-fleet CodeBuild runners)

Model / Fleet Status Req/s Throughput p50 E2E (ms) p95 E2E (ms) p99 E2E (ms) TTFT p95 (ms)
flux2-klein-4b / x86-g6xl-runner PASS 0.098 0.098 img/s 10183.45 10287.91 10287.91
qwen2.5-omni-3b / x86-g6e12xl-runner PASS 0.028 158.36 tok/s 70322.24 91547.04 98614.59 47.25
qwen3-tts-1.7b-customvoice / x86-g6xl-runner PASS 1.678 6.029× RT 2125.67 2752.81 2798.53
qwen3-tts-12hz-1.7b-base / x86-g6exl-runner PASS 1.183 4.595× RT 3193.46 3980.63 4280.94
wan2.1-t2v-1.3b / x86-g6exl-runner PASS 0.664 0.664 vid/s 1506.80 1507.50 1507.50

Devbox concurrency sweeps (g6e.xlarge, L40S 46 GB, vllm:omni-cuda-v1)

qwen3-tts-1.7b-customvoice c=1: 1.14 req/s, 4.08x RT, p95 1123 ms
qwen3-tts-1.7b-customvoice c=4: 2.10 req/s, 8.17x RT, p95 2387 ms
qwen3-tts-1.7b-customvoice c=8: 3.47 req/s, 13.00x RT, p95 3637 ms
qwen3-tts-1.7b-customvoice c=16: 4.63 req/s, 18.00x RT, p95 4237 ms

qwen3-tts-12hz-1.7b-base c=1: 1.03 req/s, 3.06x RT, p95 1201 ms
qwen3-tts-12hz-1.7b-base c=4: 1.86 req/s, 6.00x RT, p95 2532 ms
qwen3-tts-12hz-1.7b-base c=8: 2.07 req/s, 6.83x RT, p95 4297 ms (RTF ~1.02, saturated)

flux2-klein-4b c=1: 0.342 images/s, p95 2944 ms
flux2-klein-4b c=2: 0.347 images/s, p95 5863 ms (pure queue)

wan2.1-t2v-1.3b c=1: 0.664 videos/s, p95 1508 ms (server inference 1.38 s)
wan2.1-t2v-1.3b c=2: 0.662 videos/s, p95 3044 ms (pure queue)

Thresholds in the YAML are calibrated from the CI run above with ~20–25% margin for noise.

PR Checklist

  • I ran pre-commit run --all-files locally before creating this PR.
  • Full CI benchmark dispatch run — all 5 models PASS.

Adds lightweight async benchmark clients for vLLM-Omni's modality endpoints
and a dispatch workflow that runs them in CI:

- TTS (/v1/audio/speech) — CustomVoice + Base (voice cloning)
- Image (/v1/images/generations)
- Video (/v1/videos with async submit+poll)
- Chat (/v1/chat/completions, SSE streaming — reports TTFT / TPOT / ITL)

Each client emits a uniform JSON summary; the dispatcher script validates
configurable thresholds and benchmark_report.py aggregates artifacts into a
markdown table for $GITHUB_STEP_SUMMARY.

vllm-omni-model-tests.yml gains a benchmark: section alongside smoke-test:,
driving a new dispatch-vllm-omni-benchmark.yml workflow. The existing
dispatch-vllm-benchmark.yml (LLM) is untouched.

The chat client uses SSE streaming so that TTFT is measurable; without SSE
only E2E is visible and scheduler/batching regressions are hidden.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei added 20 commits April 30, 2026 11:31
workflow_dispatch is only available for the default branch, so it cannot
be used to test this workflow on the PR branch. Add a scoped pull_request
trigger so the benchmark runs on PRs that touch the workflow, dispatcher,
or benchmark clients.

The pull_request block is marked TEMPORARY with a clear comment — remove
it once the workflow has run successfully against a PR. IMAGE_URI falls
back to the vllm:omni-cuda-v1 DLC image used in the baseline devbox
runs when no workflow_dispatch input is provided.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…el name

- Relax TTS/image benchmark thresholds to match g6.xlarge (L4 24GB)
  performance instead of g6e.xlarge (L40S) baselines
- Pass model name to chat benchmark client to fix 'model default does
  not exist' 404 error for qwen2.5-omni-3b

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…e thresholds

- Use iter_any() chunk reading instead of line iteration to avoid
  aiohttp's 128KB line-size limit (omni models emit large audio frames)
- Further relax qwen3-tts-12hz-1.7b-base thresholds — performance on
  L4 is highly variable (0.075-0.244 RPS across runs)

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Validation script now writes threshold_passed into the result JSON so
the report step can distinguish threshold failures from request failures.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model produces corrupt output on L4 — Code2Wav receives malformed
input_ids (length 1, not divisible by num_quantizers 16) due to
ONNX/CUDA graph issues on Ada Lovelace. Restore original L40S
thresholds.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Omni model has fast token generation (157 tps) but slow 3-stage audio
synthesis (thinker→talker→code2wav), making E2E ~92s per request.
Adjust min_rps and max_p95_e2e_ms to reflect actual omni-chat behavior.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model intermittently produces malformed input_ids (length 1) on both
L4 and L40S — this is a model/vllm-omni bug, not hardware-specific.
Relax thresholds to tolerate the failure rate while still catching
complete regressions.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…script

The tts-base voice-cloning model requires ref_text to be the exact
transcript of the reference audio. The wav says 'The quick brown fox
jumps over the lazy dog near the riverbank at sunset.' not 'Hello, how
are you?' — mismatched transcript causes Code2Wav to produce malformed
output.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Now that ref_text matches the actual audio transcript, restore the
original L40S thresholds (min_rps=1.3, min_audio_rtf_mult=4.5,
max_p95_e2e_ms=3000). Added comments about the ref_text requirement
and link to vllm-omni#3124.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…coding

Download .txt file alongside .wav from S3 (same path, .wav→.txt) to
get the exact transcript. Falls back to config ref_text if .txt not
found. Prevents future inconsistency between audio and transcript.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
TTFT p95 was concatenated into the p50 E2E column making it confusing.
Give it a dedicated column instead.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Model works correctly now (ref_text fix). Actual: 1.272 RPS, 3666ms
p95. Add margin: min_rps 1.3→1.0, max_p95_e2e_ms 3000→4500.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
S3 .txt download may be failing silently. Add ref_text back to config
as fallback and echo the resolved value for debugging.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei marked this pull request as ready for review May 1, 2026 23:11
@Yadan-Wei Yadan-Wei enabled auto-merge (squash) May 1, 2026 23:44
@Yadan-Wei Yadan-Wei merged commit 0a11bb6 into main May 5, 2026
36 checks passed
@Yadan-Wei Yadan-Wei deleted the omni-benchmark branch May 5, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants