[TRTLLM-13383][feat] Add support for Qwen3.5 VL Dense#15249
Conversation
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
|
/bot run |
|
PR_Github #53507 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR implements end-to-end Qwen3.5 multimodal vision-language (VL) model support for TensorRT-LLM. It introduces configuration normalization layers to adapt HuggingFace configs into Qwen3Next-compatible shapes, registers new VLM model classes and weight mappers, integrates VLM forward paths with speculative-decoding support, extends the multimodal test harness for hybrid KV-cache architectures, and provides comprehensive unit and integration test coverage for both dense and MoE variants. ChangesQwen3.5 VL Infrastructure and Tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen3vl.py (1)
1286-1315:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve tokenizer IDs here, not the fused multimodal placeholder IDs.
orig_input_idson this path are already postprocessed multimodal ids. In this file,_postprocess()rewrites every image/video token toself.tllm_multimodal_token_id, and that placeholder is defined asvocab_size + 1. The new fallback inSpecDecOneEngineForCausalLM.forward()can therefore hand out out-of-vocab ids to draft models, and several draft paths inmodeling_speculative.pydoembed_tokens(input_ids)wheninputs_embedsis absent. Please thread the pre-_postprocess()tokenizer ids (or a text-only sequence with MM spans removed) into spec decoding instead of reusing the fused placeholder sequence.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/models/modeling_qwen3vl.py` around lines 1286 - 1315, orig_input_ids is being set from the already post-processed/fused sequence (after _postprocess and fuse_input_embeds) which yields multimodal placeholder IDs; instead capture and pass the original tokenizer IDs (or a text-only token sequence with MM spans removed) created before _postprocess into SpecDecOneEngineForCausalLM.forward. Locate where input_ids is first preprocessed (the variable produced before _postprocess/fuse_input_embeds) and replace the assignment orig_input_ids = input_ids with orig_input_ids = <pre-_postprocess_token_ids> (or construct a filtered text-only sequence), then pass that orig_input_ids through the self.llm.forward call so downstream SpecDecOneEngineForCausalLM.forward and modeling_speculative.py embedding paths receive valid in-vocab token IDs rather than the fused placeholder IDs.
🧹 Nitpick comments (2)
tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py (2)
383-383: ⚡ Quick winDocument or derive the hardcoded dimension
3in expand().The
.expand(3, -1, 1)call hardcodes the first dimension as3, which corresponds to the number of mRoPE sections (temporal, height, width). Consider deriving this fromlen(self.hf_config.text_config.rope_parameters["mrope_section"])or adding a comment explaining that 3 is the fixed mRoPE dimension count for vision-language models.♻️ Option 1: Derive from config
+ num_mrope_sections = len(self.hf_config.text_config.rope_parameters["mrope_section"]) trtllm_inputs["position_ids"] = ( - (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda() + (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(num_mrope_sections, -1, 1).cuda() )♻️ Option 2: Add explanatory comment
+ # mRoPE uses 3 dimensions: temporal (video frames), height, width trtllm_inputs["position_ids"] = ( (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda() )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py` at line 383, The hardcoded first dimension 3 in the expand call on trtllm_inputs["position_ids"] + mrope_gen_position_ids should be derived or documented: replace the literal with the mRoPE section count by using len(self.hf_config.text_config.rope_parameters["mrope_section"]) (or assign that length to a local variable like mrope_sections_count) when calling .expand, or if the model guarantees a fixed 3-section mRoPE, add a concise comment next to the .expand call (referencing trtllm_inputs, mrope_gen_position_ids, and .expand) stating that 3 equals the number of mRoPE sections (temporal, height, width).
381-402: 💤 Low valueInconsistent device-movement pattern.
Line 381 uses
.to(self.device)while lines 383 and 402 use.cuda()directly. For consistency and flexibility (e.g., if self.device ever becomes configurable), prefer.to(self.device)throughout or document why direct.cuda()calls are used.♻️ Consistent device movement
- mrope_gen_position_ids = torch.cat(mrope_gen_position_ids, dim=-1).to(self.device) + mrope_gen_position_ids = torch.cat(mrope_gen_position_ids, dim=-1).to(self.device) trtllm_inputs["position_ids"] = ( - (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda() + (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).to(self.device) ) gen_multimodal_params_list = [] for multimodal_param in multimodal_params_list: @@ -399,7 +399,7 @@ mrope_position_ids.append( multimodal_param.multimodal_data["mrope_config"]["mrope_position_ids"] ) - position_ids = torch.cat(mrope_position_ids, dim=-1).cuda() + position_ids = torch.cat(mrope_position_ids, dim=-1).to(self.device) trtllm_inputs["position_ids"] = position_ids🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py` around lines 381 - 402, The test uses mixed device movement calls: replace direct .cuda() calls with .to(self.device) for consistency and configurability — update the lines setting trtllm_inputs["position_ids"] (currently using .cuda()) and the final position_ids = torch.cat(...).cuda() to use .to(self.device); ensure any tensors like mrope_gen_position_ids, trtllm_inputs["position_ids"], and the mrope_position_ids concatenation use .to(self.device) so device handling is uniform across mrope_gen_position_ids, trtllm_inputs, gen_multimodal_params_list, and position_ids.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_5.py`:
- Around line 466-476: The load_weights method currently overwrites the injected
weight_mapper with a new Qwen3_5MoeHfWeightMapper; preserve the mapper passed
into load_weights (do not reassign weight_mapper) so the mapper provided by
ModelLoader.load / checkpoint_loader.get_initialized_weight_mapper is honored,
and remove the line instantiating Qwen3_5MoeHfWeightMapper; if this code truly
only supports HF-style mappers, replace the overwrite with an explicit type
check/assert (e.g. verify isinstance(weight_mapper, Qwen3_5MoeHfWeightMapper)
and raise a clear error) before calling self.llm.load_weights(filtered_weights,
weight_mapper, params_map=params_map).
- Around line 438-451: The VLM wrapper _Qwen3_5VLModel is not propagating the
inner Qwen3Next LM's get_model_defaults (so enable_block_reuse=False is lost);
implement an override of get_model_defaults on _Qwen3_5VLModel (or on
Qwen3VLModelBase if shared) that calls/merges the inner decoder's defaults (from
Qwen3_5ForCausalLM / Qwen3_5MoeForCausalLM or the Qwen3Next decoder class) with
the wrapper defaults and ensures enable_block_reuse is set to False when absent;
update ModelLoader.load_config_and_apply_defaults usage to rely on this merged
dict so the VLM path inherits the inner decoder's enable_block_reuse=False
default.
In `@tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`:
- Around line 521-548: The TestQwen3_5_27B_VL integration test is missing the
Hopper gating and should be skipped on pre-Hopper systems; add the
`@skip_pre_hopper` decorator above the class TestQwen3_5_27B_VL (the class that
defines MODEL_NAME "Qwen/Qwen3.5-27B" and method test_auto_dtype) so the dense
Qwen3.5-VL variant won't run on unsupported SM versions.
In `@tests/unittest/_torch/modeling/test_modeling_multimodal.py`:
- Around line 582-624: The hybrid cache-manager builder
get_hybrid_kv_cache_manager always constructs kv_cache_config =
PyKvCacheConfig(max_tokens=...) and never sets enable_block_reuse from the test
scenario, so block-reuse cannot be toggled; update get_hybrid_kv_cache_manager
to accept or read the scenario.kv_cache_reuse flag (or a passed-in
kv_cache_reuse param) and pass it into PyKvCacheConfig(enable_block_reuse=...)
when constructing kv_cache_config, ensuring the test's
MultimodalScenario.kv_cache_reuse affects the PyKvCacheConfig used by
CppMambaHybridCacheManager; use the existing symbol names PyKvCacheConfig,
get_hybrid_kv_cache_manager, and MultimodalScenario.kv_cache_reuse to locate and
modify the code.
In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py`:
- Around line 1-2: Replace the SPDX-style two-line header (the lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") with the
repository's required full NVIDIA Apache 2.0 header block that includes the year
of latest modification and the full license text; update the top of
tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py by removing the
SPDX lines and prepending the canonical multi-line NVIDIA Apache-2.0 header used
across the repo so the file header matches the coding guidelines.
In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py`:
- Around line 349-352: The test moves position tensors to the current CUDA
device with .cuda(), causing potential device-mismatch; update the code that
constructs trtllm_inputs["position_ids"] to keep tensors on self.device by
replacing .cuda() with .to(self.device) (ensure torch.cat(...).to(self.device) /
the final .expand(...).to(self.device) as appropriate) for the block using
mrope_gen_position_ids and the similar occurrence around the other
trtllm_inputs["position_ids"] usage so all position_ids live on self.device
alongside the model and payloads.
- Around line 1-2: This file is missing the required NVIDIA copyright header
block at the top; prepend the full NVIDIA copyright header (the standard
multi-line header used across the repo for .py files) to the very top of this
new Python source (tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py)
so the SPDX lines remain but are preceded by the canonical NVIDIA header block.
- Around line 119-127: The test is missing an assertion that the loaded
dense-only config preserves an empty deepstack index list; update
test_qwen35_dense_vl_config_preserves_vlm_architecture to assert that
config.visual_config.deepstack_visual_indexes is an empty list (or use
getattr(config.visual_config, "deepstack_visual_indexes", []) == []) so the
dense-vs-MoE contract is enforced; locate the assertion block around
config.text_config checks and add this check referring to config and
config.visual_config.deepstack_visual_indexes.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3vl.py`:
- Around line 1286-1315: orig_input_ids is being set from the already
post-processed/fused sequence (after _postprocess and fuse_input_embeds) which
yields multimodal placeholder IDs; instead capture and pass the original
tokenizer IDs (or a text-only token sequence with MM spans removed) created
before _postprocess into SpecDecOneEngineForCausalLM.forward. Locate where
input_ids is first preprocessed (the variable produced before
_postprocess/fuse_input_embeds) and replace the assignment orig_input_ids =
input_ids with orig_input_ids = <pre-_postprocess_token_ids> (or construct a
filtered text-only sequence), then pass that orig_input_ids through the
self.llm.forward call so downstream SpecDecOneEngineForCausalLM.forward and
modeling_speculative.py embedding paths receive valid in-vocab token IDs rather
than the fused placeholder IDs.
---
Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py`:
- Line 383: The hardcoded first dimension 3 in the expand call on
trtllm_inputs["position_ids"] + mrope_gen_position_ids should be derived or
documented: replace the literal with the mRoPE section count by using
len(self.hf_config.text_config.rope_parameters["mrope_section"]) (or assign that
length to a local variable like mrope_sections_count) when calling .expand, or
if the model guarantees a fixed 3-section mRoPE, add a concise comment next to
the .expand call (referencing trtllm_inputs, mrope_gen_position_ids, and
.expand) stating that 3 equals the number of mRoPE sections (temporal, height,
width).
- Around line 381-402: The test uses mixed device movement calls: replace direct
.cuda() calls with .to(self.device) for consistency and configurability — update
the lines setting trtllm_inputs["position_ids"] (currently using .cuda()) and
the final position_ids = torch.cat(...).cuda() to use .to(self.device); ensure
any tensors like mrope_gen_position_ids, trtllm_inputs["position_ids"], and the
mrope_position_ids concatenation use .to(self.device) so device handling is
uniform across mrope_gen_position_ids, trtllm_inputs,
gen_multimodal_params_list, and position_ids.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 37daafab-d6b6-4d29-b1f5-cb9518d5881f
📒 Files selected for processing (16)
docs/source/models/supported-models.mdtensorrt_llm/_torch/models/__init__.pytensorrt_llm/_torch/models/checkpoints/hf/qwen3_5_weight_mapper.pytensorrt_llm/_torch/models/modeling_qwen3_5.pytensorrt_llm/_torch/models/modeling_qwen3_next.pytensorrt_llm/_torch/models/modeling_qwen3vl.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/pyexecutor/config_utils.pytensorrt_llm/_torch/pyexecutor/model_loader.pytests/integration/defs/accuracy/references/mmmu.yamltests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.pytests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/test-db/l0_h100.ymltests/unittest/_torch/modeling/test_modeling_multimodal.pytests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.pytests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py
|
PR_Github #53507 [ run ] completed with state
|
| while keeping `atol` at `0.4` to absorb single-logit tail outliers. | ||
| Same band as the MoE parity test. | ||
| """ | ||
| return 0.4, 0.1 |
There was a problem hiding this comment.
Wait wait wait, we have such high deviation from the transformers output?
There was a problem hiding this comment.
The multimodal harness default is 0.4, 0.4 (test_modeling_multimodal.py:213) - every other VLM test (Qwen2.5-VL, Qwen3-VL, Gemma3-VL, …) runs at that. We override to 0.4, 0.1, i.e. 4x tighter rtol, same atol. So this is the strictest band in the VLM suite, not a loosening one?
|
Thanks for the work — flagging the items I think should block merge:
|
|
Thanks for the work on this — a few items I'd suggest addressing before merge: 1. (Blocker) Base conflict. 2. (Blocker) 3. (Should-fix in this PR, not a follow-up) 4. (Need conclusion before merge) Transformers reference deviation. In Items 1–4 are merge gates from my side. Other CodeRabbit Major items (e.g. the |
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
Description
Completes Qwen3.5 VL Dense model.
Temporarily rebased on top of #14599 for the MoE variant (the very first commit).
Test Coverage
Accuracy & unit tests
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.