Skip to content

feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b#6079

Merged
Yadan-Wei merged 6 commits into
mainfrom
omni-benchmark-fix
May 13, 2026
Merged

feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b#6079
Yadan-Wei merged 6 commits into
mainfrom
omni-benchmark-fix

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented May 12, 2026

Summary

Comprehensive update to the vllm-omni benchmark suite for the 0.20.0 release. Six commits, four logical changes:

  1. Threshold adjustments for qwen3-tts-12hz-1.7b-base (real upstream Code2Wav regression) and drop min_output_tps for qwen2.5-omni-3b (chunk-count fallback was unreliable across versions).
  2. Align chat-omni token counting with upstream — read metrics.num_tokens_out from each SSE chunk (vllm-omni engine counter, version-stable) instead of falling back to chunk count. Matches vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
  3. Expand the benchmark suite with three new models reusing existing clients (cosyvoice3-0.5b, ernie-image-turbo, wan2.1-vace-1.3b) plus a new audio_generate_benchmark_client.py + audio-generate benchmark type for stable-audio-open-1.0 on the new /v1/audio/generate endpoint. All four baselined to real numbers with ~25% CI margin.
  4. Route qwen2.5-omni-3b around x86-g6e12xl-runner ICE by adding a runner-scale-sets: group at peer level with codebuild-fleet: and a parallel benchmark-runner-scale workflow job that uses gpu-l40s-4gpu-runners (same 4× L40S hardware). Mirrors the SGLang/vLLM scale-set pattern.

Threshold changes

qwen3-tts-12hz-1.7b-base — temporary loosening (real upstream regression)

Threshold Before After
min_rps 0.4 0.27
min_audio_rtf_mult 1.6 1.0
max_p95_e2e_ms 11000 17000

Root cause is upstream vllm-omni#3203 un-batching Code2Wav decode chunks. Fix is merged as vllm-omni#3485 post-0.20.0 — we'll re-tighten when the next omni point release is picked up.

qwen2.5-omni-3b — drop min_output_tps

The metric collapsed ~50× across releases (158 → 3) on identical config because it's measuring SSE chunk count, not real tokens. Server reports usage.completion_tokens=0 (verified via devbox SSE capture on 0.18.0). RPS / TTFT p95 / E2E p95 are stable and cover the user-facing SLO. After this PR, the client also reads metrics.num_tokens_out (engine counter, version-stable) so a future PR can re-introduce min_output_tps against a stable metric.

Four new entries baselined (2026-05-12, vllm-omni 0.20.0, ~25% CI margin)

Model Type Fleet Observed → threshold
cosyvoice3-0.5b tts-base x86-g6exl-runner rps 0.348 → 0.26, rtf 2.119 → 1.6, p95 e2e 15639ms → 20000
stable-audio-open-1.0 audio-generate (new) x86-g6xl-runner rps 0.141 → 0.10, rtf 0.706 → 0.5, p95 e2e 7167ms → 9500
ernie-image-turbo image x86-g6exl-runner images/s 0.067 → 0.05, p95 e2e 17573ms → 22000
wan2.1-vace-1.3b video x86-g6exl-runner videos/s 0.332 → 0.25, p95 e2e 3010ms → 4000

All 9 benchmark entries now have pass/fail thresholds.

Token-counting alignment with upstream

chat_omni_benchmark_client.py previously counted SSE chunks when usage.completion_tokens was 0 (which omni always reports). Chunk count depends on SSE batching and swung 50× between 0.18.0 and 0.20.0 on identical workloads.

New precedence chain — matches upstream vllm_omni/benchmarks/patch/patch.py:

  1. metrics.num_tokens_out from each SSE chunk (vllm-omni engine counter, version-stable)
  2. usage.completion_tokens (OpenAI standard, omni reports 0)
  3. len(token_times) (chunk count fallback, last resort)

New benchmark client

audio_generate_benchmark_client.py (~250 LOC) targets /v1/audio/generate, the diffusion-based audio endpoint introduced in vllm-omni v0.20.0 (#1794). Same async machinery + WAV duration parsing + metric set as tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput_s_per_s), but disjoint request shape (audio_length, guidance_scale, num_inference_steps, seed, negative_prompt) so a separate client is cleaner than overloading TTS.

New audio-generate benchmark_type wired into the dispatcher; threshold validators reuse the tts/tts-base branch since metric names match.

ICE workaround

x86-g6e12xl-runner CodeBuild fleet (4× L40S 192 GB) has been ICE in us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.

  • Add benchmark.runner-scale-sets: group in the YAML config alongside codebuild-fleet:. Move qwen2.5-omni-3b there with runner_label: gpu-l40s-4gpu-runners (same 4× L40S hardware → no threshold changes).
  • Expand the runner-scale-sets comment block with all 9 available k8s scale-set labels.
  • Extend dispatch-vllm-omni-benchmark.yml's parser to emit both matrices.
  • Add a parallel benchmark-runner-scale job that uses runs-on: \${{ matrix.runner_label }} (no fleet: selector), docker cps the model + test scripts into the container (host docker daemon doesn't see the pod's /dlc-models), and runs the dispatcher inside the container via docker exec (runner pod ↔ docker host are separate network namespaces, so a host-side localhost:8080 health check doesn't work).
  • benchmark-report waits for both benchmark jobs.

Changes

File Type What
.github/config/model-tests/vllm-omni-model-tests.yml modified threshold adjustments, 3 new entries, runner-scale-sets group, expanded fleet-label comments
.github/workflows/dispatch-vllm-omni-benchmark.yml modified parse both matrices, parallel benchmark-runner-scale job
test/vllm-omni/scripts/benchmark/chat_omni_benchmark_client.py modified read metrics.num_tokens_out per upstream pattern
test/vllm-omni/scripts/benchmark/audio_generate_benchmark_client.py new client for /v1/audio/generate
test/vllm-omni/scripts/vllm_omni_benchmark_test.sh modified dispatch new audio-generate type; threshold validator branch
test/vllm-omni/scripts/benchmark/README.md modified document new client + token-counting behavior

Test Plan

  • pre-commit run passes (ruff, ruff-format, flowmark, gh-actions lint, signoff, etc.)
  • Source-code level verification of serving_chat.py SSE shape across 0.18.0/0.20.0.
  • Devbox SSE capture on 0.18.0 confirms server reports completion_tokens: 0 (benchmark client falls back to engine counter or chunk count).
  • CI benchmark workflow on 0.20.0 image: 9/9 benchmarks PASS (run 25769991290 — qwen2.5-omni-3b green on gpu-l40s-4gpu-runners after the runner-scale-sets fix; all 8 CodeBuild-fleet entries green with the new thresholds).
  • After 0.20.1 (or whichever omni release picks up Bump boto3 from 1.11.11 to 1.28.76 #3485) is integrated, re-tighten qwen3-tts-12hz-1.7b-base thresholds in a follow-up PR.
  • Re-introduce min_output_tps for qwen2.5-omni-3b against metrics.num_tokens_out once a baseline run on the new code path is captured.

🤖 Generated with Claude Code

qwen3-tts-12hz-1.7b-base: temporarily loosen rps/audio_rtf_mult/p95_e2e
to absorb the upstream Code2Wav decode-chunk un-batching regression
from vllm-omni#3203. Fix is merged as vllm-omni#3485 post-0.20.0; will
re-tighten when next omni point release is picked up.

qwen2.5-omni-3b: drop min_output_tps. The SSE event stream changed in
0.20.0 (vllm-omni#3082 delegation to upstream vllm OpenAI entrypoint)
so the benchmark client now counts text tokens only (~95/req) instead
of text + codec frames (~5656/req in 0.18.0). The metric is no longer
comparable across versions; rps / ttft p95 / e2e p95 cover the
user-facing SLO without ambiguity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei added 5 commits May 12, 2026 00:31
Earlier comment cited vllm-omni#3082 as the SSE-format change that
caused the metric to swing across versions. Source-code review of
serving_chat.py at v0.18.0 vs v0.20.0 plus a devbox SSE capture
showed:

- Both versions still place audio in delta.content per yield (no
  documented change to chat-completions SSE shape).
- Server reports usage.completion_tokens=0 in the streamed [DONE]
  block on 0.18.0; the benchmark client therefore falls back to
  len(token_times) (a chunk count).
- Under concurrent load the per-chunk emit pattern shifts between
  releases enough to swing the value by ~50x (158 -> 3) on identical
  config, even though RPS / TTFT / e2e are unchanged.

Replace the #3082 attribution with the verified explanation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…d benchmark suite

Token-counting fix for chat_omni_benchmark_client.py
----------------------------------------------------
The client now reads `metrics.num_tokens_out` from each SSE chunk —
the vllm-omni engine-side counter — matching upstream
vllm_omni/benchmarks/patch/patch.py::async_request_openai_chat_omni_completions.
This is version-stable, unlike the previous fallbacks:
  * usage.completion_tokens (OpenAI standard) — omni reports 0
  * len(token_times) (chunk count) — swings ~50× between 0.18.0 and 0.20.0
    due to SSE batching changes (158 -> 3 on identical config)
Both are kept as second/third-preference fallbacks. README and YAML
comments updated to reflect the stable metric.

New benchmark entries (reuse existing clients)
----------------------------------------------
  cosyvoice3-0.5b      tts-base    x86-g6exl-runner
  ernie-image-turbo    image       x86-g6exl-runner
  wan2.1-vace-1.3b     video       x86-g6exl-runner

All three have thresholds intentionally unset; baseline on first run
and tighten with the standard ~25% CI margin.

New benchmark client for stable-audio-open-1.0
----------------------------------------------
audio_generate_benchmark_client.py targets /v1/audio/generate (new
endpoint in vllm-omni v0.20.0 per vllm-project/vllm-omni#1794). Uses
the same async machinery, WAV duration parser, and metric set as
tts_benchmark_client.py (TTFB / E2E / RTF / RPS / audio_throughput),
but the request shape is disjoint (audio_length, guidance_scale,
num_inference_steps, seed, negative_prompt) so a separate client is
cleaner than overloading the TTS one. New `audio-generate`
benchmark_type wired into the dispatcher; threshold validators reuse
the tts/tts-base branch since metric names match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…g6e12xl ICE

Threshold baselining (2026-05-12, vllm-omni 0.20.0)
---------------------------------------------------
First-run numbers + ~25% CI margin applied to four previously-open entries:

  cosyvoice3-0.5b           rps 0.348  rtf 2.119  p95 e2e 15639ms
                             -> min_rps 0.26 / min_audio_rtf_mult 1.6 / max_p95_e2e_ms 20000
  stable-audio-open-1.0     rps 0.141  rtf 0.706  p95 e2e 7167ms
                             -> min_rps 0.10 / min_audio_rtf_mult 0.5 / max_p95_e2e_ms 9500
  ernie-image-turbo         images/s 0.067  p95 e2e 17573ms
                             -> min_images_per_s 0.05 / max_p95_e2e_ms 22000
  wan2.1-vace-1.3b          videos/s 0.332  p95 e2e 3010ms
                             -> min_videos_per_s 0.25 / max_p95_e2e_ms 4000

All 9 benchmark entries now carry pass/fail thresholds.

ICE workaround for qwen2.5-omni-3b
----------------------------------
The x86-g6e12xl-runner CodeBuild fleet (4x L40S 192 GB) has been ICE in
us-west-2 since 2026-05-12, blocking the qwen2.5-omni-3b benchmark.

Mirror the SGLang/vLLM pattern of supporting both CodeBuild fleets and
k8s-backed runner-scale-sets:

- Add `benchmark.runner-scale-sets:` group in vllm-omni-model-tests.yml
  alongside `benchmark.codebuild-fleet:`. Move qwen2.5-omni-3b there
  with runner_label `gpu-l40s-4gpu-runners` (same 4x L40S hardware).
- Expand the runner-scale-sets comment block to list all 9 available
  k8s scale-set labels and their hardware mappings.
- Extend dispatch-vllm-omni-benchmark.yml's load-benchmarks parser to
  emit both matrices.
- Add a parallel `benchmark-runner-scale` job that uses
  `runs-on: ${{ matrix.runner_label }}` (no `fleet:` selector), pins
  GPU access to the pod's assigned UUIDs so parallel pods don't
  contend, and skips `docker rmi` since the host Docker daemon is
  shared across pods.
- `benchmark-report` now waits for both benchmark jobs.

Same hardware class (4x L40S 192 GB) so qwen2.5-omni-3b's existing
thresholds (min_rps 0.02, max_p95_ttft_ms 1500, max_p95_e2e_ms 120000)
do not need to change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The previous benchmark-runner-scale step bind-mounted /dlc-models from
the runner pod into the container via `-v /dlc-models:/models`. On a
k8s-backed scale-set, the host docker daemon resolves bind-mount paths
on the host filesystem, not the pod filesystem — so the container sees
an empty /models, vllm-omni's omni_snapshot_download falls through to
its HuggingFace path, and crashes with:

  huggingface_hub.errors.HFValidationError:
  Repo id must be in the form 'repo_name' or 'namespace/repo_name':
  '/models/qwen2.5-omni-3b'

Mirror the SGLang runner-scale pattern: start the container with
`--entrypoint /bin/bash` so it idles instead of immediately invoking
`vllm serve` with a bad path, `docker cp` the model from the pod
filesystem into the container, then launch the server via
`docker exec -d`. The host-side health check on localhost:8080 still
works because `-p 8080:8080` is unchanged.

CodeBuild fleet jobs are unaffected — they continue to bind-mount the
runner's /dlc-models since the CodeBuild docker daemon is on the same
host as the runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
… host

Previous fix (2d61941) docker-cp'd the model into the container but still
ran the dispatcher (vllm_omni_benchmark_test.sh) from the runner pod, so
its `curl http://localhost:8080/health` polled the runner's loopback —
not the container's. The runner pod and the docker host on a k8s scale-set
are separate network namespaces, so even with `-p 8080:8080` the runner
can't reach the published port. Result: 600s health-check timeout, exit 1,
no server logs since the docker-exec'd server was still healthy inside the
container.

Mirror the SGLang scale-set path end-to-end:
- docker cp test/vllm-omni/scripts into /workspace/scripts in the
  container at start time
- Launch `vllm serve --omni ...` via `docker exec -d` and redirect output
  to /workspace/server.log
- Run the dispatcher itself via `docker exec` so all networking
  (curl /health, the per-modality benchmark client) uses the container's
  own loopback. Results are written to /workspace/benchmark_results inside
  the container, then docker-cp'd out for upload-artifact.
- Drop the now-unused `-p 8080:8080` publish (host can't reach it anyway).

CodeBuild fleet jobs are unaffected — they continue to bind-mount and
poll localhost:8080 since the runner and the docker daemon share the
same host network there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
@Yadan-Wei Yadan-Wei changed the title fix(vllm-omni): adjust benchmark thresholds for 0.20.0 SSE/Code2Wav changes feat(vllm-omni): align chat-omni token counting + 3 new benchmarks + ICE workaround for qwen2.5-omni-3b May 13, 2026
@Yadan-Wei Yadan-Wei merged commit a85bb6a into main May 13, 2026
66 of 70 checks passed
@Yadan-Wei Yadan-Wei deleted the omni-benchmark-fix branch May 13, 2026 17:27
@Yadan-Wei Yadan-Wei mentioned this pull request May 13, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants