[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization by yechank-nvidia · Pull Request #11943 · NVIDIA/TensorRT-LLM

yechank-nvidia · 2026-03-05T09:45:24Z

Summary

This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.

What changed

Vision tower / rotary embedding

Drop the HF rotary dependency and memoize the frequency table; pre-compute cos/sin into init-time buffers so the forward path no longer calls .cos()/.sin() per step.
Add an L2 per-tile GPU rotary cache for Qwen2.5/3-VL vision and annotate its measured GPU footprint.
Remove dead code and a redundant batched pos-embed kernel from the vision tower.

Host-overhead reduction

Add an async_tensor_h2d helper and route all Qwen2.5/3-VL H2D copies (vision pos_ids, window_index, rope_position_ids) through it.
Skip redundant pinning in maybe_pin_memory when the input is already pinned.
Pre-allocate deepstack scratch and skip vision-encoder host syncs.
Add a text-only fast path in the Qwen2.5/3-VL input processors so text-only requests avoid vision-path work.

CUDA graph

Enable piecewise CUDA graph for LLM prefill on Qwen2/3-VL.

Refactor

Inherit the Qwen3-VL input processor from the Qwen2-VL base to remove duplication.

Performance

Model: Qwen3VLForConditionalGeneration (FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstream main with identical serving config (max_batch_size=256, max_num_tokens=8192, num_postprocess_workers=4, cuda_graph_config.enable_padding=true, chunked prefill on).

System output-token throughput (tok/s)

concurrency	upstream	this PR	Δ
32	4735	4838	+2.2%
64	6948	7359	+5.9%
128	8553	9767	+14.2%

Other metrics

Per-user throughput (tok/s/user) — c=128: 77.5 → 89.8 (+15.9%)
TTFT (ms) — c=1: 147 → 81 (−45%); c=32: 815 → 690 (−15%); c=128: 1953 → 1804 (−8%)
ITL (ms/token) — c=128: 12.96 → 11.18 (−13.7%)
Request latency (ms) — c=128: 14901 → 12975 (−12.9%)

Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed rotary cache index computation in attention kernels.
- Corrected request broadcasting logic for distributed inference configurations.
Performance
- Optimized multimodal model inference through vectorized RoPE position indexing and GPU memory operations.
- Added Triton kernel support for position embedding interpolation in vision models.
- Improved device transfer efficiency with async host-to-device operations and optimized tensor placement.
- Enhanced distributed tensor parallelism handling.
Tests
- Expanded test coverage for multimodal RoPE configurations and vision component equivalence validation.

yechank-nvidia · 2026-05-21T11:16:09Z

/bot run

tensorrt-cicd · 2026-05-21T11:21:57Z

PR_Github #49687 [ run ] triggered by Bot. Commit: 3489cf2 Link to invocation

tensorrt-cicd · 2026-05-21T11:35:15Z

PR_Github #49687 [ run ] completed with state FAILURE. Commit: 3489cf2
/LLM/main/L0_MergeRequest_PR pipeline #39294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia · 2026-05-26T05:25:44Z

/bot run

tensorrt-cicd · 2026-05-26T05:31:03Z

PR_Github #50284 [ run ] triggered by Bot. Commit: da59f84 Link to invocation

tensorrt-cicd · 2026-05-26T07:11:16Z

PR_Github #50284 [ run ] completed with state SUCCESS. Commit: da59f84
/LLM/main/L0_MergeRequest_PR pipeline #39813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia · 2026-05-27T04:04:02Z

/bot run

tensorrt-cicd · 2026-05-27T04:10:44Z

PR_Github #50450 [ run ] triggered by Bot. Commit: 657bb10 Link to invocation

tensorrt-cicd · 2026-05-27T09:16:15Z

PR_Github #50450 [ run ] completed with state SUCCESS. Commit: 657bb10
/LLM/main/L0_MergeRequest_PR pipeline #39969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Apply the codebase's RST-literal style (``foo``) to inline code references in comments / docstrings on the branch-touched lines of Qwen2/3-VL model and test files; no logic change. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…uff-format) Pure formatting cleanup -- no logic change. Brings the branch-touched Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling tests in line with the repo's pre-commit hooks (yapf, clang-format, ruff-format). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

- Restore the original `torch_dtype` / `max_position_embeddings` propagation comment on `Qwen3VLVisionAttention.__init__`. - Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype` guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)` already short-circuits when the dtype matches. - Collapse the doubled backticks introduced earlier in this branch back to single backticks on the branch-touched lines, matching the reviewer-preferred style. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Replace the misleading `pragma: no cover - flash_attn is part of the default deps` line on the flash_attn rotary import: flash_attn is only declared in `triton_backend/requirements.txt` and the multimodal extras, not the main `requirements.txt`, so the guarded import is genuinely the load-time fallback when flash_attn isn't installed. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Extract the ``_build_temporal_block`` step as a classmethod hook on ``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference (per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses ``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source), ``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before super), and ``_build_temporal_block`` (plain ``np.indices``). Drops ~95% of the duplicated tokenizer / processor / mrope / call logic and the matching unused imports. Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via ``type(self)`` so the subclass override is actually used, and condense the ``bypass_processor_output_validation`` rationale comment. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

2ez4bz · 2026-06-12T17:31:07Z

/bot run

tensorrt-cicd · 2026-06-12T17:37:38Z

PR_Github #53929 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

tensorrt-cicd · 2026-06-13T03:24:47Z

PR_Github #53929 [ run ] completed with state FAILURE. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43022 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

2ez4bz · 2026-06-13T05:55:15Z

/bot run

tensorrt-cicd · 2026-06-13T06:00:32Z

PR_Github #54018 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

tensorrt-cicd · 2026-06-13T06:50:50Z

PR_Github #54018 [ run ] completed with state FAILURE. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43102 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia · 2026-06-13T12:43:14Z

/bot run

tensorrt-cicd · 2026-06-13T12:49:26Z

PR_Github #54047 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

tensorrt-cicd · 2026-06-13T15:56:20Z

PR_Github #54047 [ run ] completed with state SUCCESS. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43131 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

2ez4bz · 2026-06-14T06:02:55Z

yechank-nvidia self-assigned this Mar 5, 2026

yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Mar 5, 2026

moraxu self-assigned this May 19, 2026

yechank-nvidia force-pushed the qwen3vl_opt branch from 87e587d to 3489cf2 Compare May 21, 2026 11:13

yechank-nvidia changed the title ~~[Draft][perf] Qwen3-VL Performance Optimization~~ [None][perf] Qwen3/3.5-VL Performance Optimization May 21, 2026

yechank-nvidia changed the title ~~[None][perf] Qwen3/3.5-VL Performance Optimization~~ [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization May 22, 2026

yechank-nvidia changed the title ~~[None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization~~ [None][perf] Qwen2.5/3/3.5-VL Performance Optimization May 22, 2026

yechank-nvidia force-pushed the qwen3vl_opt branch from 3489cf2 to 124f649 Compare May 26, 2026 02:25

yechank-nvidia force-pushed the qwen3vl_opt branch from da59f84 to 657bb10 Compare May 27, 2026 01:15

yechank-nvidia marked this pull request as ready for review May 27, 2026 10:55

yechank-nvidia requested review from a team as code owners May 27, 2026 10:55

yechank-nvidia requested review from byshiue, moraxu, symphonylyh, tijyojwad and xxi-nv May 27, 2026 10:55

yechank-nvidia added 18 commits June 12, 2026 10:26

[None][style] Drop redundant float() casts on rotary scale factors

22fe228

Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][style] Apply ruff-format auto-fixes from pre-commit

c8fc3e0

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][perf] cache Qwen VL MRoPE deltas

e307d6f

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] make Qwen VL flash-attn optional

2bf982e

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] support packed Qwen VL attention segments

e5cb831

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] avoid multimodal encoder meta init fallback

d8c38f1

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][perf] Skip rebuilding provided attention cu seqlens

ba3ec46

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] Remove unused cuda graph import

f0a7053

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][refactor] Add explicit Qwen VL LLM compile hook

b036594

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] Support Qwen VL image embedding inputs

1dea0dd

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][test] Cover Qwen VL image embedding attach

2a9e003

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] reuse Qwen VL disagg prompt expansion for embeddings

9572dc6

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] grow Qwen3 VL vision position id buffer

ea61e98

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

2ez4bz force-pushed the qwen3vl_opt branch from b3e9766 to ea61e98 Compare June 12, 2026 17:26

yechank-nvidia merged commit 1283c6b into NVIDIA:main Jun 13, 2026
7 checks passed

Uh oh!

Conversation

yechank-nvidia commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Vision tower / rotary embedding

Host-overhead reduction

CUDA graph

Refactor

Performance

System output-token throughput (tok/s)

Other metrics

Summary by CodeRabbit

Release Notes

Uh oh!

yechank-nvidia commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

yechank-nvidia commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

yechank-nvidia commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

2ez4bz commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

2ez4bz commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

yechank-nvidia commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

Uh oh!

2ez4bz commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yechank-nvidia commented Mar 5, 2026 •

edited

Loading