feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b by Yadan-Wei · Pull Request #6079 · aws/deep-learning-containers

Yadan-Wei · 2026-05-12T01:12:00Z

Summary

Comprehensive update to the vllm-omni benchmark suite for the 0.20.0 release. Six commits, four logical changes:

Threshold adjustments for qwen3-tts-12hz-1.7b-base (real upstream Code2Wav regression) and drop min_output_tps for qwen2.5-omni-3b (chunk-count fallback was unreliable across versions).
Align chat-omni token counting with upstream — read metrics.num_tokens_out from each SSE chunk (vllm-omni engine counter, version-stable) instead of falling back to chunk count. Matches vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
Expand the benchmark suite with three new models reusing existing clients (cosyvoice3-0.5b, ernie-image-turbo, wan2.1-vace-1.3b) plus a new audio_generate_benchmark_client.py + audio-generate benchmark type for stable-audio-open-1.0 on the new /v1/audio/generate endpoint. All four baselined to real numbers with ~25% CI margin.
Route qwen2.5-omni-3b around x86-g6e12xl-runner ICE by adding a runner-scale-sets: group at peer level with codebuild-fleet: and a parallel benchmark-runner-scale workflow job that uses gpu-l40s-4gpu-runners (same 4× L40S hardware). Mirrors the SGLang/vLLM scale-set pattern.

Threshold changes

`qwen3-tts-12hz-1.7b-base` — temporary loosening (real upstream regression)

Threshold	Before	After
`min_rps`	0.4	0.27
`min_audio_rtf_mult`	1.6	1.0
`max_p95_e2e_ms`	11000	17000

Root cause is upstream vllm-omni#3203 un-batching Code2Wav decode chunks. Fix is merged as vllm-omni#3485 post-0.20.0 — we'll re-tighten when the next omni point release is picked up.

`qwen2.5-omni-3b` — drop `min_output_tps`

The metric collapsed ~50× across releases (158 → 3) on identical config because it's measuring SSE chunk count, not real tokens. Server reports usage.completion_tokens=0 (verified via devbox SSE capture on 0.18.0). RPS / TTFT p95 / E2E p95 are stable and cover the user-facing SLO. After this PR, the client also reads metrics.num_tokens_out (engine counter, version-stable) so a future PR can re-introduce min_output_tps against a stable metric.

Four new entries baselined (2026-05-12, vllm-omni 0.20.0, ~25% CI margin)

Model	Type	Fleet	Observed → threshold
`cosyvoice3-0.5b`	tts-base	x86-g6exl-runner	rps 0.348 → 0.26, rtf 2.119 → 1.6, p95 e2e 15639ms → 20000
`stable-audio-open-1.0`	audio-generate (new)	x86-g6xl-runner	rps 0.141 → 0.10, rtf 0.706 → 0.5, p95 e2e 7167ms → 9500
`ernie-image-turbo`	image	x86-g6exl-runner	images/s 0.067 → 0.05, p95 e2e 17573ms → 22000
`wan2.1-vace-1.3b`	video	x86-g6exl-runner	videos/s 0.332 → 0.25, p95 e2e 3010ms → 4000

All 9 benchmark entries now have pass/fail thresholds.

Token-counting alignment with upstream

chat_omni_benchmark_client.py previously counted SSE chunks when usage.completion_tokens was 0 (which omni always reports). Chunk count depends on SSE batching and swung 50× between 0.18.0 and 0.20.0 on identical workloads.

New precedence chain — matches upstream vllm_omni/benchmarks/patch/patch.py:

metrics.num_tokens_out from each SSE chunk (vllm-omni engine counter, version-stable)
usage.completion_tokens (OpenAI standard, omni reports 0)
len(token_times) (chunk count fallback, last resort)

New benchmark client

audio_generate_benchmark_client.py (~250 LOC) targets /v1/audio/generate, the diffusion-based audio endpoint introduced in vllm-omni v0.20.0 (#1794). Same async machinery + WAV duration parsing + metric set as tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput_s_per_s), but disjoint request shape (audio_length, guidance_scale, num_inference_steps, seed, negative_prompt) so a separate client is cleaner than overloading TTS.

New audio-generate benchmark_type wired into the dispatcher; threshold validators reuse the tts/tts-base branch since metric names match.

ICE workaround

x86-g6e12xl-runner CodeBuild fleet (4× L40S 192 GB) has been ICE in us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.

Add benchmark.runner-scale-sets: group in the YAML config alongside codebuild-fleet:. Move qwen2.5-omni-3b there with runner_label: gpu-l40s-4gpu-runners (same 4× L40S hardware → no threshold changes).
Expand the runner-scale-sets comment block with all 9 available k8s scale-set labels.
Extend dispatch-vllm-omni-benchmark.yml's parser to emit both matrices.
Add a parallel benchmark-runner-scale job that uses runs-on: \${{ matrix.runner_label }} (no fleet: selector), docker cps the model + test scripts into the container (host docker daemon doesn't see the pod's /dlc-models), and runs the dispatcher inside the container via docker exec (runner pod ↔ docker host are separate network namespaces, so a host-side localhost:8080 health check doesn't work).
benchmark-report waits for both benchmark jobs.

Changes

File	Type	What
`.github/config/model-tests/vllm-omni-model-tests.yml`	modified	threshold adjustments, 3 new entries, runner-scale-sets group, expanded fleet-label comments
`.github/workflows/dispatch-vllm-omni-benchmark.yml`	modified	parse both matrices, parallel benchmark-runner-scale job
`test/vllm-omni/scripts/benchmark/chat_omni_benchmark_client.py`	modified	read `metrics.num_tokens_out` per upstream pattern
`test/vllm-omni/scripts/benchmark/audio_generate_benchmark_client.py`	new	client for `/v1/audio/generate`
`test/vllm-omni/scripts/vllm_omni_benchmark_test.sh`	modified	dispatch new `audio-generate` type; threshold validator branch
`test/vllm-omni/scripts/benchmark/README.md`	modified	document new client + token-counting behavior

Test Plan

pre-commit run passes (ruff, ruff-format, flowmark, gh-actions lint, signoff, etc.)
Source-code level verification of serving_chat.py SSE shape across 0.18.0/0.20.0.
Devbox SSE capture on 0.18.0 confirms server reports completion_tokens: 0 (benchmark client falls back to engine counter or chunk count).
CI benchmark workflow on 0.20.0 image: 9/9 benchmarks PASS (run 25769991290 — qwen2.5-omni-3b green on gpu-l40s-4gpu-runners after the runner-scale-sets fix; all 8 CodeBuild-fleet entries green with the new thresholds).
After 0.20.1 (or whichever omni release picks up Bump boto3 from 1.11.11 to 1.28.76 #3485) is integrated, re-tighten qwen3-tts-12hz-1.7b-base thresholds in a follow-up PR.
Re-introduce min_output_tps for qwen2.5-omni-3b against metrics.num_tokens_out once a baseline run on the new code path is captured.

🤖 Generated with Claude Code

qwen3-tts-12hz-1.7b-base: temporarily loosen rps/audio_rtf_mult/p95_e2e to absorb the upstream Code2Wav decode-chunk un-batching regression from vllm-omni#3203. Fix is merged as vllm-omni#3485 post-0.20.0; will re-tighten when next omni point release is picked up. qwen2.5-omni-3b: drop min_output_tps. The SSE event stream changed in 0.20.0 (vllm-omni#3082 delegation to upstream vllm OpenAI entrypoint) so the benchmark client now counts text tokens only (~95/req) instead of text + codec frames (~5656/req in 0.18.0). The metric is no longer comparable across versions; rps / ttft p95 / e2e p95 cover the user-facing SLO without ambiguity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Earlier comment cited vllm-omni#3082 as the SSE-format change that caused the metric to swing across versions. Source-code review of serving_chat.py at v0.18.0 vs v0.20.0 plus a devbox SSE capture showed: - Both versions still place audio in delta.content per yield (no documented change to chat-completions SSE shape). - Server reports usage.completion_tokens=0 in the streamed [DONE] block on 0.18.0; the benchmark client therefore falls back to len(token_times) (a chunk count). - Under concurrent load the per-chunk emit pattern shifts between releases enough to swing the value by ~50x (158 -> 3) on identical config, even though RPS / TTFT / e2e are unchanged. Replace the #3082 attribution with the verified explanation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…d benchmark suite Token-counting fix for chat_omni_benchmark_client.py ---------------------------------------------------- The client now reads `metrics.num_tokens_out` from each SSE chunk — the vllm-omni engine-side counter — matching upstream vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions. This is version-stable, unlike the previous fallbacks: * usage.completion_tokens (OpenAI standard) — omni reports 0 * len(token_times) (chunk count) — swings ~50× between 0.18.0 and 0.20.0 due to SSE batching changes (158 -> 3 on identical config) Both are kept as second/third-preference fallbacks. README and YAML comments updated to reflect the stable metric. New benchmark entries (reuse existing clients) ---------------------------------------------- cosyvoice3-0.5b tts-base x86-g6exl-runner ernie-image-turbo image x86-g6exl-runner wan2.1-vace-1.3b video x86-g6exl-runner All three have thresholds intentionally unset; baseline on first run and tighten with the standard ~25% CI margin. New benchmark client for stable-audio-open-1.0 ---------------------------------------------- audio_generate_benchmark_client.py targets /v1/audio/generate (new endpoint in vllm-omni v0.20.0 per vllm-project/vllm-omni#1794). Uses the same async machinery, WAV duration parser, and metric set as tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput), but the request shape is disjoint (audio_length, guidance_scale, num_inference_steps, seed, negative_prompt) so a separate client is cleaner than overloading the TTS one. New `audio-generate` benchmark_type wired into the dispatcher; threshold validators reuse the tts/tts-base branch since metric names match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…g6e12xl ICE Threshold baselining (2026-05-12, vllm-omni 0.20.0) --------------------------------------------------- First-run numbers + ~25% CI margin applied to four previously-open entries: cosyvoice3-0.5b rps 0.348 rtf 2.119 p95 e2e 15639ms -> min_rps 0.26 / min_audio_rtf_mult 1.6 / max_p95_e2e_ms 20000 stable-audio-open-1.0 rps 0.141 rtf 0.706 p95 e2e 7167ms -> min_rps 0.10 / min_audio_rtf_mult 0.5 / max_p95_e2e_ms 9500 ernie-image-turbo images/s 0.067 p95 e2e 17573ms -> min_images_per_s 0.05 / max_p95_e2e_ms 22000 wan2.1-vace-1.3b videos/s 0.332 p95 e2e 3010ms -> min_videos_per_s 0.25 / max_p95_e2e_ms 4000 All 9 benchmark entries now carry pass/fail thresholds. ICE workaround for qwen2.5-omni-3b ---------------------------------- The x86-g6e12xl-runner CodeBuild fleet (4x L40S 192 GB) has been ICE in us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark. Mirror the SGLang/vLLM pattern of supporting both CodeBuild fleets and k8s-backed runner-scale-sets: - Add `benchmark.runner-scale-sets:` group in vllm-omni-model-tests.yml alongside `benchmark.codebuild-fleet:`. Move qwen2.5-omni-3b there with runner_label `gpu-l40s-4gpu-runners` (same 4x L40S hardware). - Expand the runner-scale-sets comment block to list all 9 available k8s scale-set labels and their hardware mappings. - Extend dispatch-vllm-omni-benchmark.yml's load-benchmarks parser to emit both matrices. - Add a parallel `benchmark-runner-scale` job that uses `runs-on: ${{ matrix.runner_label }}` (no `fleet:` selector), pins GPU access to the pod's assigned UUIDs so parallel pods don't contend, and skips `docker rmi` since the host Docker daemon is shared across pods. - `benchmark-report` now waits for both benchmark jobs. Same hardware class (4x L40S 192 GB) so qwen2.5-omni-3b's existing thresholds (min_rps 0.02, max_p95_ttft_ms 1500, max_p95_e2e_ms 120000) do not need to change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The previous benchmark-runner-scale step bind-mounted /dlc-models from the runner pod into the container via `-v /dlc-models:/models`. On a k8s-backed scale-set, the host docker daemon resolves bind-mount paths on the host filesystem, not the pod filesystem — so the container sees an empty /models, vllm-omni's omni_snapshot_download falls through to its HuggingFace path, and crashes with: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/models/qwen2.5-omni-3b' Mirror the SGLang runner-scale pattern: start the container with `--entrypoint /bin/bash` so it idles instead of immediately invoking `vllm serve` with a bad path, `docker cp` the model from the pod filesystem into the container, then launch the server via `docker exec -d`. The host-side health check on localhost:8080 still works because `-p 8080:8080` is unchanged. CodeBuild fleet jobs are unaffected — they continue to bind-mount the runner's /dlc-models since the CodeBuild docker daemon is on the same host as the runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

… host Previous fix (2d61941) docker-cp'd the model into the container but still ran the dispatcher (vllm_omni_benchmark_test.sh) from the runner pod, so its `curl http://localhost:8080/health` polled the runner's loopback — not the container's. The runner pod and the docker host on a k8s scale-set are separate network namespaces, so even with `-p 8080:8080` the runner can't reach the published port. Result: 600s health-check timeout, exit 1, no server logs since the docker-exec'd server was still healthy inside the container. Mirror the SGLang scale-set path end-to-end: - docker cp test/vllm-omni/scripts into /workspace/scripts in the container at start time - Launch `vllm serve --omni ...` via `docker exec -d` and redirect output to /workspace/server.log - Run the dispatcher itself via `docker exec` so all networking (curl /health, the per-modality benchmark client) uses the container's own loopback. Results are written to /workspace/benchmark_results inside the container, then docker-cp'd out for upload-artifact. - Drop the now-unused `-p 8080:8080` publish (host can't reach it anyway). CodeBuild fleet jobs are unaffected — they continue to bind-mount and poll localhost:8080 since the runner and the docker daemon share the same host network there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label May 12, 2026

Yadan Wei added 5 commits May 12, 2026 00:31

Yadan-Wei changed the title ~~fix(vllm-omni): adjust benchmark thresholds for 0.20.0 SSE/Code2Wav changes~~ feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b May 13, 2026

sirutBuasai approved these changes May 13, 2026

View reviewed changes

Yadan-Wei merged commit a85bb6a into main May 13, 2026
66 of 70 checks passed

Yadan-Wei deleted the omni-benchmark-fix branch May 13, 2026 17:27

Yadan-Wei mentioned this pull request May 13, 2026

[Docs Update] vLLM-Omni 0.20.0 #6088

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b#6079

feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b#6079
Yadan-Wei merged 6 commits into
mainfrom
omni-benchmark-fix

Yadan-Wei commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yadan-Wei commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Threshold changes

qwen3-tts-12hz-1.7b-base — temporary loosening (real upstream regression)

qwen2.5-omni-3b — drop min_output_tps

Four new entries baselined (2026-05-12, vllm-omni 0.20.0, ~25% CI margin)

Token-counting alignment with upstream

New benchmark client

ICE workaround

Changes

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yadan-Wei commented May 12, 2026 •

edited

Loading

`qwen3-tts-12hz-1.7b-base` — temporary loosening (real upstream regression)

`qwen2.5-omni-3b` — drop `min_output_tps`