feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b#6079
Merged
Merged
Conversation
qwen3-tts-12hz-1.7b-base: temporarily loosen rps/audio_rtf_mult/p95_e2e to absorb the upstream Code2Wav decode-chunk un-batching regression from vllm-omni#3203. Fix is merged as vllm-omni#3485 post-0.20.0; will re-tighten when next omni point release is picked up. qwen2.5-omni-3b: drop min_output_tps. The SSE event stream changed in 0.20.0 (vllm-omni#3082 delegation to upstream vllm OpenAI entrypoint) so the benchmark client now counts text tokens only (~95/req) instead of text + codec frames (~5656/req in 0.18.0). The metric is no longer comparable across versions; rps / ttft p95 / e2e p95 cover the user-facing SLO without ambiguity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>
added 5 commits
May 12, 2026 00:31
Earlier comment cited vllm-omni#3082 as the SSE-format change that caused the metric to swing across versions. Source-code review of serving_chat.py at v0.18.0 vs v0.20.0 plus a devbox SSE capture showed: - Both versions still place audio in delta.content per yield (no documented change to chat-completions SSE shape). - Server reports usage.completion_tokens=0 in the streamed [DONE] block on 0.18.0; the benchmark client therefore falls back to len(token_times) (a chunk count). - Under concurrent load the per-chunk emit pattern shifts between releases enough to swing the value by ~50x (158 -> 3) on identical config, even though RPS / TTFT / e2e are unchanged. Replace the #3082 attribution with the verified explanation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…d benchmark suite
Token-counting fix for chat_omni_benchmark_client.py
----------------------------------------------------
The client now reads `metrics.num_tokens_out` from each SSE chunk —
the vllm-omni engine-side counter — matching upstream
vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
This is version-stable, unlike the previous fallbacks:
* usage.completion_tokens (OpenAI standard) — omni reports 0
* len(token_times) (chunk count) — swings ~50× between 0.18.0 and 0.20.0
due to SSE batching changes (158 -> 3 on identical config)
Both are kept as second/third-preference fallbacks. README and YAML
comments updated to reflect the stable metric.
New benchmark entries (reuse existing clients)
----------------------------------------------
cosyvoice3-0.5b tts-base x86-g6exl-runner
ernie-image-turbo image x86-g6exl-runner
wan2.1-vace-1.3b video x86-g6exl-runner
All three have thresholds intentionally unset; baseline on first run
and tighten with the standard ~25% CI margin.
New benchmark client for stable-audio-open-1.0
----------------------------------------------
audio_generate_benchmark_client.py targets /v1/audio/generate (new
endpoint in vllm-omni v0.20.0 per vllm-project/vllm-omni#1794). Uses
the same async machinery, WAV duration parser, and metric set as
tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput),
but the request shape is disjoint (audio_length, guidance_scale,
num_inference_steps, seed, negative_prompt) so a separate client is
cleaner than overloading the TTS one. New `audio-generate`
benchmark_type wired into the dispatcher; threshold validators reuse
the tts/tts-base branch since metric names match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…g6e12xl ICE
Threshold baselining (2026-05-12, vllm-omni 0.20.0)
---------------------------------------------------
First-run numbers + ~25% CI margin applied to four previously-open entries:
cosyvoice3-0.5b rps 0.348 rtf 2.119 p95 e2e 15639ms
-> min_rps 0.26 / min_audio_rtf_mult 1.6 / max_p95_e2e_ms 20000
stable-audio-open-1.0 rps 0.141 rtf 0.706 p95 e2e 7167ms
-> min_rps 0.10 / min_audio_rtf_mult 0.5 / max_p95_e2e_ms 9500
ernie-image-turbo images/s 0.067 p95 e2e 17573ms
-> min_images_per_s 0.05 / max_p95_e2e_ms 22000
wan2.1-vace-1.3b videos/s 0.332 p95 e2e 3010ms
-> min_videos_per_s 0.25 / max_p95_e2e_ms 4000
All 9 benchmark entries now carry pass/fail thresholds.
ICE workaround for qwen2.5-omni-3b
----------------------------------
The x86-g6e12xl-runner CodeBuild fleet (4x L40S 192 GB) has been ICE in
us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.
Mirror the SGLang/vLLM pattern of supporting both CodeBuild fleets and
k8s-backed runner-scale-sets:
- Add `benchmark.runner-scale-sets:` group in vllm-omni-model-tests.yml
alongside `benchmark.codebuild-fleet:`. Move qwen2.5-omni-3b there
with runner_label `gpu-l40s-4gpu-runners` (same 4x L40S hardware).
- Expand the runner-scale-sets comment block to list all 9 available
k8s scale-set labels and their hardware mappings.
- Extend dispatch-vllm-omni-benchmark.yml's load-benchmarks parser to
emit both matrices.
- Add a parallel `benchmark-runner-scale` job that uses
`runs-on: ${{ matrix.runner_label }}` (no `fleet:` selector), pins
GPU access to the pod's assigned UUIDs so parallel pods don't
contend, and skips `docker rmi` since the host Docker daemon is
shared across pods.
- `benchmark-report` now waits for both benchmark jobs.
Same hardware class (4x L40S 192 GB) so qwen2.5-omni-3b's existing
thresholds (min_rps 0.02, max_p95_ttft_ms 1500, max_p95_e2e_ms 120000)
do not need to change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The previous benchmark-runner-scale step bind-mounted /dlc-models from the runner pod into the container via `-v /dlc-models:/models`. On a k8s-backed scale-set, the host docker daemon resolves bind-mount paths on the host filesystem, not the pod filesystem — so the container sees an empty /models, vllm-omni's omni_snapshot_download falls through to its HuggingFace path, and crashes with: huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/models/qwen2.5-omni-3b' Mirror the SGLang runner-scale pattern: start the container with `--entrypoint /bin/bash` so it idles instead of immediately invoking `vllm serve` with a bad path, `docker cp` the model from the pod filesystem into the container, then launch the server via `docker exec -d`. The host-side health check on localhost:8080 still works because `-p 8080:8080` is unchanged. CodeBuild fleet jobs are unaffected — they continue to bind-mount the runner's /dlc-models since the CodeBuild docker daemon is on the same host as the runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>
… host Previous fix (2d61941) docker-cp'd the model into the container but still ran the dispatcher (vllm_omni_benchmark_test.sh) from the runner pod, so its `curl http://localhost:8080/health` polled the runner's loopback — not the container's. The runner pod and the docker host on a k8s scale-set are separate network namespaces, so even with `-p 8080:8080` the runner can't reach the published port. Result: 600s health-check timeout, exit 1, no server logs since the docker-exec'd server was still healthy inside the container. Mirror the SGLang scale-set path end-to-end: - docker cp test/vllm-omni/scripts into /workspace/scripts in the container at start time - Launch `vllm serve --omni ...` via `docker exec -d` and redirect output to /workspace/server.log - Run the dispatcher itself via `docker exec` so all networking (curl /health, the per-modality benchmark client) uses the container's own loopback. Results are written to /workspace/benchmark_results inside the container, then docker-cp'd out for upload-artifact. - Drop the now-unused `-p 8080:8080` publish (host can't reach it anyway). CodeBuild fleet jobs are unaffected — they continue to bind-mount and poll localhost:8080 since the runner and the docker daemon share the same host network there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yadan Wei <yadanwei@amazon.com>
sirutBuasai
approved these changes
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive update to the vllm-omni benchmark suite for the 0.20.0 release. Six commits, four logical changes:
qwen3-tts-12hz-1.7b-base(real upstream Code2Wav regression) and dropmin_output_tpsforqwen2.5-omni-3b(chunk-count fallback was unreliable across versions).metrics.num_tokens_outfrom each SSE chunk (vllm-omni engine counter, version-stable) instead of falling back to chunk count. Matchesvllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.cosyvoice3-0.5b,ernie-image-turbo,wan2.1-vace-1.3b) plus a newaudio_generate_benchmark_client.py+audio-generatebenchmark type forstable-audio-open-1.0on the new/v1/audio/generateendpoint. All four baselined to real numbers with ~25% CI margin.x86-g6e12xl-runnerICE by adding arunner-scale-sets:group at peer level withcodebuild-fleet:and a parallelbenchmark-runner-scaleworkflow job that usesgpu-l40s-4gpu-runners(same 4× L40S hardware). Mirrors the SGLang/vLLM scale-set pattern.Threshold changes
qwen3-tts-12hz-1.7b-base— temporary loosening (real upstream regression)min_rpsmin_audio_rtf_multmax_p95_e2e_msRoot cause is upstream vllm-omni#3203 un-batching Code2Wav decode chunks. Fix is merged as vllm-omni#3485 post-0.20.0 — we'll re-tighten when the next omni point release is picked up.
qwen2.5-omni-3b— dropmin_output_tpsThe metric collapsed ~50× across releases (158 → 3) on identical config because it's measuring SSE chunk count, not real tokens. Server reports
usage.completion_tokens=0(verified via devbox SSE capture on 0.18.0). RPS / TTFT p95 / E2E p95 are stable and cover the user-facing SLO. After this PR, the client also readsmetrics.num_tokens_out(engine counter, version-stable) so a future PR can re-introducemin_output_tpsagainst a stable metric.Four new entries baselined (2026-05-12, vllm-omni 0.20.0, ~25% CI margin)
cosyvoice3-0.5bstable-audio-open-1.0ernie-image-turbowan2.1-vace-1.3bAll 9 benchmark entries now have pass/fail thresholds.
Token-counting alignment with upstream
chat_omni_benchmark_client.pypreviously counted SSE chunks whenusage.completion_tokenswas 0 (which omni always reports). Chunk count depends on SSE batching and swung 50× between 0.18.0 and 0.20.0 on identical workloads.New precedence chain — matches upstream
vllm_omni/benchmarks/patch/patch.py:metrics.num_tokens_outfrom each SSE chunk (vllm-omni engine counter, version-stable)usage.completion_tokens(OpenAI standard, omni reports 0)len(token_times)(chunk count fallback, last resort)New benchmark client
audio_generate_benchmark_client.py(~250 LOC) targets/v1/audio/generate, the diffusion-based audio endpoint introduced in vllm-omni v0.20.0 (#1794). Same async machinery + WAV duration parsing + metric set astts_benchmark_client.py(TTFB / E2E / RTF / RPS / audio_throughput_s_per_s), but disjoint request shape (audio_length,guidance_scale,num_inference_steps,seed,negative_prompt) so a separate client is cleaner than overloading TTS.New
audio-generatebenchmark_type wired into the dispatcher; threshold validators reuse the tts/tts-base branch since metric names match.ICE workaround
x86-g6e12xl-runnerCodeBuild fleet (4× L40S 192 GB) has been ICE in us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.benchmark.runner-scale-sets:group in the YAML config alongsidecodebuild-fleet:. Move qwen2.5-omni-3b there withrunner_label: gpu-l40s-4gpu-runners(same 4× L40S hardware → no threshold changes).dispatch-vllm-omni-benchmark.yml's parser to emit both matrices.benchmark-runner-scalejob that usesruns-on: \${{ matrix.runner_label }}(nofleet:selector),docker cps the model + test scripts into the container (host docker daemon doesn't see the pod's/dlc-models), and runs the dispatcher inside the container viadocker exec(runner pod ↔ docker host are separate network namespaces, so a host-sidelocalhost:8080health check doesn't work).benchmark-reportwaits for both benchmark jobs.Changes
.github/config/model-tests/vllm-omni-model-tests.yml.github/workflows/dispatch-vllm-omni-benchmark.ymltest/vllm-omni/scripts/benchmark/chat_omni_benchmark_client.pymetrics.num_tokens_outper upstream patterntest/vllm-omni/scripts/benchmark/audio_generate_benchmark_client.py/v1/audio/generatetest/vllm-omni/scripts/vllm_omni_benchmark_test.shaudio-generatetype; threshold validator branchtest/vllm-omni/scripts/benchmark/README.mdTest Plan
pre-commit runpasses (ruff, ruff-format, flowmark, gh-actions lint, signoff, etc.)serving_chat.pySSE shape across 0.18.0/0.20.0.completion_tokens: 0(benchmark client falls back to engine counter or chunk count).gpu-l40s-4gpu-runnersafter the runner-scale-sets fix; all 8 CodeBuild-fleet entries green with the new thresholds).qwen3-tts-12hz-1.7b-basethresholds in a follow-up PR.min_output_tpsforqwen2.5-omni-3bagainstmetrics.num_tokens_outonce a baseline run on the new code path is captured.🤖 Generated with Claude Code