Skip to content

[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943

Merged
yechank-nvidia merged 43 commits into
NVIDIA:mainfrom
yechank-nvidia:qwen3vl_opt
Jun 13, 2026
Merged

[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943
yechank-nvidia merged 43 commits into
NVIDIA:mainfrom
yechank-nvidia:qwen3vl_opt

Conversation

@yechank-nvidia

@yechank-nvidia yechank-nvidia commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.

What changed

Vision tower / rotary embedding

  • Drop the HF rotary dependency and memoize the frequency table; pre-compute cos/sin into init-time buffers so the forward path no longer calls .cos()/.sin() per step.
  • Add an L2 per-tile GPU rotary cache for Qwen2.5/3-VL vision and annotate its measured GPU footprint.
  • Remove dead code and a redundant batched pos-embed kernel from the vision tower.

Host-overhead reduction

  • Add an async_tensor_h2d helper and route all Qwen2.5/3-VL H2D copies (vision pos_ids, window_index, rope_position_ids) through it.
  • Skip redundant pinning in maybe_pin_memory when the input is already pinned.
  • Pre-allocate deepstack scratch and skip vision-encoder host syncs.
  • Add a text-only fast path in the Qwen2.5/3-VL input processors so text-only requests avoid vision-path work.

CUDA graph

  • Enable piecewise CUDA graph for LLM prefill on Qwen2/3-VL.

Refactor

  • Inherit the Qwen3-VL input processor from the Qwen2-VL base to remove duplication.

Performance

Model: Qwen3VLForConditionalGeneration (FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstream main with identical serving config (max_batch_size=256, max_num_tokens=8192, num_postprocess_workers=4, cuda_graph_config.enable_padding=true, chunked prefill on).

System output-token throughput (tok/s)

concurrency upstream this PR Δ
32 4735 4838 +2.2%
64 6948 7359 +5.9%
128 8553 9767 +14.2%

Other metrics

  • Per-user throughput (tok/s/user) — c=128: 77.5 → 89.8 (+15.9%)
  • TTFT (ms) — c=1: 147 → 81 (−45%); c=32: 815 → 690 (−15%); c=128: 1953 → 1804 (−8%)
  • ITL (ms/token) — c=128: 12.96 → 11.18 (−13.7%)
  • Request latency (ms) — c=128: 14901 → 12975 (−12.9%)

Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed rotary cache index computation in attention kernels.
    • Corrected request broadcasting logic for distributed inference configurations.
  • Performance

    • Optimized multimodal model inference through vectorized RoPE position indexing and GPU memory operations.
    • Added Triton kernel support for position embedding interpolation in vision models.
    • Improved device transfer efficiency with async host-to-device operations and optimized tensor placement.
    • Enhanced distributed tensor parallelism handling.
  • Tests

    • Expanded test coverage for multimodal RoPE configurations and vision component equivalence validation.

Review Change Stack

@yechank-nvidia yechank-nvidia self-assigned this Mar 5, 2026
@yechank-nvidia yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Mar 5, 2026
@moraxu moraxu self-assigned this May 19, 2026
@yechank-nvidia yechank-nvidia changed the title [Draft][perf] Qwen3-VL Performance Optimization [None][perf] Qwen3/3.5-VL Performance Optimization May 21, 2026
@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49687 [ run ] triggered by Bot. Commit: 3489cf2 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49687 [ run ] completed with state FAILURE. Commit: 3489cf2
/LLM/main/L0_MergeRequest_PR pipeline #39294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia yechank-nvidia changed the title [None][perf] Qwen3/3.5-VL Performance Optimization [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization May 22, 2026
@yechank-nvidia yechank-nvidia changed the title [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization [None][perf] Qwen2.5/3/3.5-VL Performance Optimization May 22, 2026
@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50284 [ run ] triggered by Bot. Commit: da59f84 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50284 [ run ] completed with state SUCCESS. Commit: da59f84
/LLM/main/L0_MergeRequest_PR pipeline #39813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50450 [ run ] triggered by Bot. Commit: 657bb10 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #50450 [ run ] completed with state SUCCESS. Commit: 657bb10
/LLM/main/L0_MergeRequest_PR pipeline #39969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia yechank-nvidia marked this pull request as ready for review May 27, 2026 10:55
@yechank-nvidia yechank-nvidia requested review from a team as code owners May 27, 2026 10:55
Apply the codebase's RST-literal style (``foo``) to inline code
references in comments / docstrings on the branch-touched lines of
Qwen2/3-VL model and test files; no logic change.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…uff-format)

Pure formatting cleanup -- no logic change. Brings the branch-touched
Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling
tests in line with the repo's pre-commit hooks (yapf, clang-format,
ruff-format).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
- Restore the original `torch_dtype` / `max_position_embeddings`
  propagation comment on `Qwen3VLVisionAttention.__init__`.
- Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype`
  guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)`
  already short-circuits when the dtype matches.
- Collapse the doubled backticks introduced earlier in this branch back
  to single backticks on the branch-touched lines, matching the
  reviewer-preferred style.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Replace the misleading `pragma: no cover - flash_attn is part of the
default deps` line on the flash_attn rotary import: flash_attn is
only declared in `triton_backend/requirements.txt` and the multimodal
extras, not the main `requirements.txt`, so the guarded import is
genuinely the load-time fallback when flash_attn isn't installed.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Python 3's `/` is true division and always returns float, so
`float(...) / float(...)` was redundant. Same effective values, less
noise.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Extract the ``_build_temporal_block`` step as a classmethod hook on
``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference
(per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed
as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses
``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source),
``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before
super), and ``_build_temporal_block`` (plain ``np.indices``). Drops
~95% of the duplicated tokenizer / processor / mrope / call logic and
the matching unused imports.

Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via
``type(self)`` so the subclass override is actually used, and condense
the ``bypass_processor_output_validation`` rationale comment.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@2ez4bz

2ez4bz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53929 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53929 [ run ] completed with state FAILURE. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43022 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz

2ez4bz commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54018 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54018 [ run ] completed with state FAILURE. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43102 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54047 [ run ] triggered by Bot. Commit: ea61e98 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54047 [ run ] completed with state SUCCESS. Commit: ea61e98
/LLM/main/L0_MergeRequest_PR pipeline #43131 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@yechank-nvidia yechank-nvidia merged commit 1283c6b into NVIDIA:main Jun 13, 2026
7 checks passed
@2ez4bz

2ez4bz commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

first try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multimodal Label for issues & PRs regarding Multimodal related objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants