[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943
Conversation
87e587d to
3489cf2
Compare
|
/bot run |
|
PR_Github #49687 [ run ] triggered by Bot. Commit: |
|
PR_Github #49687 [ run ] completed with state
|
3489cf2 to
124f649
Compare
|
/bot run |
|
PR_Github #50284 [ run ] triggered by Bot. Commit: |
|
PR_Github #50284 [ run ] completed with state
|
da59f84 to
657bb10
Compare
|
/bot run |
|
PR_Github #50450 [ run ] triggered by Bot. Commit: |
|
PR_Github #50450 [ run ] completed with state
|
Apply the codebase's RST-literal style (``foo``) to inline code references in comments / docstrings on the branch-touched lines of Qwen2/3-VL model and test files; no logic change. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…uff-format) Pure formatting cleanup -- no logic change. Brings the branch-touched Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling tests in line with the repo's pre-commit hooks (yapf, clang-format, ruff-format). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
- Restore the original `torch_dtype` / `max_position_embeddings` propagation comment on `Qwen3VLVisionAttention.__init__`. - Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype` guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)` already short-circuits when the dtype matches. - Collapse the doubled backticks introduced earlier in this branch back to single backticks on the branch-touched lines, matching the reviewer-preferred style. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Replace the misleading `pragma: no cover - flash_attn is part of the default deps` line on the flash_attn rotary import: flash_attn is only declared in `triton_backend/requirements.txt` and the multimodal extras, not the main `requirements.txt`, so the guarded import is genuinely the load-time fallback when flash_attn isn't installed. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Extract the ``_build_temporal_block`` step as a classmethod hook on ``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference (per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses ``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source), ``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before super), and ``_build_temporal_block`` (plain ``np.indices``). Drops ~95% of the duplicated tokenizer / processor / mrope / call logic and the matching unused imports. Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via ``type(self)`` so the subclass override is actually used, and condense the ``bypass_processor_output_validation`` rationale comment. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
/bot run |
|
PR_Github #53929 [ run ] triggered by Bot. Commit: |
|
PR_Github #53929 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54018 [ run ] triggered by Bot. Commit: |
|
PR_Github #54018 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54047 [ run ] triggered by Bot. Commit: |
|
PR_Github #54047 [ run ] completed with state |

Summary
This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.
What changed
Vision tower / rotary embedding
cos/sininto init-time buffers so the forward path no longer calls.cos()/.sin()per step.Host-overhead reduction
async_tensor_h2dhelper and route all Qwen2.5/3-VL H2D copies (visionpos_ids,window_index,rope_position_ids) through it.maybe_pin_memorywhen the input is already pinned.CUDA graph
Refactor
Performance
Model:
Qwen3VLForConditionalGeneration(FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstreammainwith identical serving config (max_batch_size=256,max_num_tokens=8192,num_postprocess_workers=4,cuda_graph_config.enable_padding=true, chunked prefill on).System output-token throughput (tok/s)
Other metrics
Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.
Summary by CodeRabbit
Release Notes
Bug Fixes
Performance
Tests