[TRTLLM-13383][feat] Add support for Qwen3.5 VL Dense by moraxu · Pull Request #15249 · NVIDIA/TensorRT-LLM

moraxu · 2026-06-11T06:56:00Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for Qwen3.5 vision-language models (both MoE and dense variants) enabling multimodal inference with image and video inputs.
- Enhanced speculative decoding to support vision-language model wrappers.
Documentation
- Updated model support matrix with Qwen3.5 multimodal model entries.
Tests
- Added integration accuracy tests for Qwen3.5 multimodal models with quantization support (FP8, bfloat16).
- Added comprehensive unit tests for model configuration, loading, and forward compatibility.

Description

Completes Qwen3.5 VL Dense model.
Temporarily rebased on top of #14599 for the MoE variant (the very first commit).

Test Coverage

Accuracy & unit tests

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

moraxu · 2026-06-11T06:56:18Z

/bot run

tensorrt-cicd · 2026-06-11T07:01:54Z

PR_Github #53507 [ run ] triggered by Bot. Commit: 4bd5c5f Link to invocation

coderabbitai · 2026-06-11T07:10:56Z

📝 Walkthrough

Walkthrough

This PR implements end-to-end Qwen3.5 multimodal vision-language (VL) model support for TensorRT-LLM. It introduces configuration normalization layers to adapt HuggingFace configs into Qwen3Next-compatible shapes, registers new VLM model classes and weight mappers, integrates VLM forward paths with speculative-decoding support, extends the multimodal test harness for hybrid KV-cache architectures, and provides comprehensive unit and integration test coverage for both dense and MoE variants.

Changes

Qwen3.5 VL Infrastructure and Tests

Layer / File(s)	Summary
Config Normalization & Dtype Resolution `tensorrt_llm/_torch/models/modeling_qwen3_5.py`, `tensorrt_llm/_torch/pyexecutor/config_utils.py`, `tensorrt_llm/_torch/pyexecutor/model_loader.py`	`Qwen35ConfigCompat` and helper functions normalize HF Qwen3.5 text/VLM configs into Qwen3Next-compatible shapes via RoPE flattening, quantization rewrites, MoE/dense alias materialization, and architecture patching. Centralized dtype resolvers (`resolve_hf_torch_dtype`, `resolve_mamba_ssm_cache_dtype`) handle `"auto"` values in Mamba and torch dtype fields, with fallback chains used by KV-cache and model-loader paths.
Model Export & Weight Mapper Registration `tensorrt_llm/_torch/models/__init__.py`, `tensorrt_llm/_torch/models/checkpoints/hf/qwen3_5_weight_mapper.py`	`Qwen3_5MoeVLModel` and `Qwen3_5VLModel` are exported from the public models API. Weight mapper registrations extend `Qwen3_5MoeHfWeightMapper` to handle both `Qwen3_5MoeForConditionalGeneration` and `Qwen3_5ForConditionalGeneration` HF model identifiers.
VLM Wrapper, Forward Integration & Speculative Decoding `tensorrt_llm/_torch/models/modeling_qwen3_5.py`, `tensorrt_llm/_torch/models/modeling_qwen3vl.py`, `tensorrt_llm/_torch/models/modeling_qwen3_next.py`, `tensorrt_llm/_torch/models/modeling_speculative.py`	`_Qwen3_5VLModel` wrapper implements vision encoder composition, multimodal placeholder metadata, and custom weight loading that filters visual weights and remaps `model.language_model.` to `model.`. VL forward path preserves `orig_input_ids` before `fuse_input_embeds` and forwards it to the LLM. Speculative decoding falls back to `kwargs["orig_input_ids"]` for VLM inputs. Qwen3VLModelBase gains Qwen3.5 architecture recognition and robust `head_dim` resolution. `Qwen3NextForCausalLM.load_weights` now accepts optional `params_map` and `allow_partial_loading` parameters.
Test Infrastructure: Hybrid KV-Cache & mRoPE Support `tests/unittest/_torch/modeling/test_modeling_multimodal.py`	Test harness detects hybrid model types and dispatches to `CppMambaHybridCacheManager` for linear-attention KV-cache initialization. New `_dummy_request_kwargs` hook enables subclasses to inject mRoPE-specific dummy request parameters. New `get_hybrid_kv_cache_manager` method extracts Mamba parameters and constructs hybrid cache config. `setUp` pre-initializes cache manager and attention metadata to `None` for clean lifecycle handling.
Unit Tests: Dense Qwen3.5-VL Parity & Config `tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py`	Synthetic dense VL config writer and unit tests verify architecture preservation, dtype resolution, model/mapper auto-resolution, and multimodal placeholder registration for `"qwen3_5"`. `TestQwen3_5VL` parity class normalizes dense configs, optionally loads HF weights, applies mRoPE position-id override for both generation and non-generation paths, and enumerates image/multiple-image/video modality sweep with hybrid KV-cache disabled.
Unit Tests: MoE Qwen3.5-VL Parity & Config `tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py`	Synthetic MoE VL config writer and unit tests validate config preservation (RoPE/MoE/Mamba settings), dtype resolution (`"auto"` → `torch.float32`), TRT-LLM auto-resolution to `Qwen3_5MoeVLModel` with `Qwen3_5MoeHfWeightMapper`, and `"qwen3_5_moe"` placeholder metadata. `TestQwen3_5MoeVL` parity class deep-copies HF config, normalizes via `_normalize_qwen35_moe_vl_config`, loads weights with post-load hooks, and overrides position-ids using mRoPE deltas from multimodal parameters during both prefill and generation.
Integration Tests: MMMU Accuracy & Test Lists `tests/integration/defs/accuracy/references/mmmu.yaml`, `tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/test-db/l0_h100.yml`	MMMU reference entries added for `Qwen/Qwen3.5-27B` and expanded for `Qwen/Qwen3.5-35B-A3B` with default, bf16, and FP8 quantization variants. Integration test classes `TestQwen3_5_35B_A3B_VL` and `TestQwen3_5_27B_VL` configure MMMU sampling, reduced KV cache batch size (32), block-reuse disable, and model-specific test paths (quantization assertion for 35B-A3B). QA and H100 test lists updated to include `TestQwen3_5MoeVL::test_all` and `TestQwen3_5VL::test_all`.
Documentation: Supported Models Matrix `docs/source/models/supported-models.md`	`Qwen3_5MoeForConditionalGeneration` entry added to multimodal feature support matrix for PyTorch backend.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#15079: Registers "qwen3_5_moe" multimodal placeholder metadata alongside Qwen3.5 MoE VLM integration (same config/model-type paths).
NVIDIA/TensorRT-LLM#14465: Reverts the same Qwen3.5 VL MoE support (VLM classes, config normalization, mapper registrations, tests) across the same modules.
NVIDIA/TensorRT-LLM#14926: Adds spec_input_ids handling to speculative decoding for draft worker compatibility; this PR extends that pattern to VLM wrappers via orig_input_ids forwarding.

Suggested labels

api-compatible

Suggested reviewers

yechank-nvidia
nv-guomingz
venkywonka
dongxuy04
2ez4bz

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the main change: adding support for Qwen3.5 VL Dense model. It includes the JIRA ticket, feature type, and a concise summary of the primary objective.
Description check	✅ Passed	The PR description is minimal but functional. It states the main objective ('Completes Qwen3.5 VL Dense model'), mentions test coverage (accuracy & unit tests), and includes the completed checklist. However, it lacks detailed explanation of what changes were made and why.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_qwen3vl.py (1)
1286-1315: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve tokenizer IDs here, not the fused multimodal placeholder IDs.

orig_input_ids on this path are already postprocessed multimodal ids. In this file, _postprocess() rewrites every image/video token to self.tllm_multimodal_token_id, and that placeholder is defined as vocab_size + 1. The new fallback in SpecDecOneEngineForCausalLM.forward() can therefore hand out out-of-vocab ids to draft models, and several draft paths in modeling_speculative.py do embed_tokens(input_ids) when inputs_embeds is absent. Please thread the pre-_postprocess() tokenizer ids (or a text-only sequence with MM spans removed) into spec decoding instead of reusing the fused placeholder sequence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/modeling_qwen3vl.py` around lines 1286 - 1315,
orig_input_ids is being set from the already post-processed/fused sequence
(after _postprocess and fuse_input_embeds) which yields multimodal placeholder
IDs; instead capture and pass the original tokenizer IDs (or a text-only token
sequence with MM spans removed) created before _postprocess into
SpecDecOneEngineForCausalLM.forward. Locate where input_ids is first
preprocessed (the variable produced before _postprocess/fuse_input_embeds) and
replace the assignment orig_input_ids = input_ids with orig_input_ids =
<pre-_postprocess_token_ids> (or construct a filtered text-only sequence), then
pass that orig_input_ids through the self.llm.forward call so downstream
SpecDecOneEngineForCausalLM.forward and modeling_speculative.py embedding paths
receive valid in-vocab token IDs rather than the fused placeholder IDs.

🧹 Nitpick comments (2)

tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py (2)

383-383: ⚡ Quick win

Document or derive the hardcoded dimension 3 in expand().

The .expand(3, -1, 1) call hardcodes the first dimension as 3, which corresponds to the number of mRoPE sections (temporal, height, width). Consider deriving this from len(self.hf_config.text_config.rope_parameters["mrope_section"]) or adding a comment explaining that 3 is the fixed mRoPE dimension count for vision-language models.

♻️ Option 1: Derive from config

+        num_mrope_sections = len(self.hf_config.text_config.rope_parameters["mrope_section"])
         trtllm_inputs["position_ids"] = (
-            (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda()
+            (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(num_mrope_sections, -1, 1).cuda()
         )

♻️ Option 2: Add explanatory comment

+        # mRoPE uses 3 dimensions: temporal (video frames), height, width
         trtllm_inputs["position_ids"] = (
             (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda()
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py` at line 383,
The hardcoded first dimension 3 in the expand call on
trtllm_inputs["position_ids"] + mrope_gen_position_ids should be derived or
documented: replace the literal with the mRoPE section count by using
len(self.hf_config.text_config.rope_parameters["mrope_section"]) (or assign that
length to a local variable like mrope_sections_count) when calling .expand, or
if the model guarantees a fixed 3-section mRoPE, add a concise comment next to
the .expand call (referencing trtllm_inputs, mrope_gen_position_ids, and
.expand) stating that 3 equals the number of mRoPE sections (temporal, height,
width).

381-402: 💤 Low value

Inconsistent device-movement pattern.

Line 381 uses .to(self.device) while lines 383 and 402 use .cuda() directly. For consistency and flexibility (e.g., if self.device ever becomes configurable), prefer .to(self.device) throughout or document why direct .cuda() calls are used.

♻️ Consistent device movement

-            mrope_gen_position_ids = torch.cat(mrope_gen_position_ids, dim=-1).to(self.device)
+            mrope_gen_position_ids = torch.cat(mrope_gen_position_ids, dim=-1).to(self.device)
             trtllm_inputs["position_ids"] = (
-                (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).cuda()
+                (trtllm_inputs["position_ids"] + mrope_gen_position_ids).expand(3, -1, 1).to(self.device)
             )
             gen_multimodal_params_list = []
             for multimodal_param in multimodal_params_list:
@@ -399,7 +399,7 @@
                 mrope_position_ids.append(
                     multimodal_param.multimodal_data["mrope_config"]["mrope_position_ids"]
                 )
-            position_ids = torch.cat(mrope_position_ids, dim=-1).cuda()
+            position_ids = torch.cat(mrope_position_ids, dim=-1).to(self.device)
             trtllm_inputs["position_ids"] = position_ids

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py` around lines
381 - 402, The test uses mixed device movement calls: replace direct .cuda()
calls with .to(self.device) for consistency and configurability — update the
lines setting trtllm_inputs["position_ids"] (currently using .cuda()) and the
final position_ids = torch.cat(...).cuda() to use .to(self.device); ensure any
tensors like mrope_gen_position_ids, trtllm_inputs["position_ids"], and the
mrope_position_ids concatenation use .to(self.device) so device handling is
uniform across mrope_gen_position_ids, trtllm_inputs,
gen_multimodal_params_list, and position_ids.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_5.py`:
- Around line 466-476: The load_weights method currently overwrites the injected
weight_mapper with a new Qwen3_5MoeHfWeightMapper; preserve the mapper passed
into load_weights (do not reassign weight_mapper) so the mapper provided by
ModelLoader.load / checkpoint_loader.get_initialized_weight_mapper is honored,
and remove the line instantiating Qwen3_5MoeHfWeightMapper; if this code truly
only supports HF-style mappers, replace the overwrite with an explicit type
check/assert (e.g. verify isinstance(weight_mapper, Qwen3_5MoeHfWeightMapper)
and raise a clear error) before calling self.llm.load_weights(filtered_weights,
weight_mapper, params_map=params_map).
- Around line 438-451: The VLM wrapper _Qwen3_5VLModel is not propagating the
inner Qwen3Next LM's get_model_defaults (so enable_block_reuse=False is lost);
implement an override of get_model_defaults on _Qwen3_5VLModel (or on
Qwen3VLModelBase if shared) that calls/merges the inner decoder's defaults (from
Qwen3_5ForCausalLM / Qwen3_5MoeForCausalLM or the Qwen3Next decoder class) with
the wrapper defaults and ensures enable_block_reuse is set to False when absent;
update ModelLoader.load_config_and_apply_defaults usage to rely on this merged
dict so the VLM path inherits the inner decoder's enable_block_reuse=False
default.

In `@tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py`:
- Around line 521-548: The TestQwen3_5_27B_VL integration test is missing the
Hopper gating and should be skipped on pre-Hopper systems; add the
`@skip_pre_hopper` decorator above the class TestQwen3_5_27B_VL (the class that
defines MODEL_NAME "Qwen/Qwen3.5-27B" and method test_auto_dtype) so the dense
Qwen3.5-VL variant won't run on unsupported SM versions.

In `@tests/unittest/_torch/modeling/test_modeling_multimodal.py`:
- Around line 582-624: The hybrid cache-manager builder
get_hybrid_kv_cache_manager always constructs kv_cache_config =
PyKvCacheConfig(max_tokens=...) and never sets enable_block_reuse from the test
scenario, so block-reuse cannot be toggled; update get_hybrid_kv_cache_manager
to accept or read the scenario.kv_cache_reuse flag (or a passed-in
kv_cache_reuse param) and pass it into PyKvCacheConfig(enable_block_reuse=...)
when constructing kv_cache_config, ensuring the test's
MultimodalScenario.kv_cache_reuse affects the PyKvCacheConfig used by
CppMambaHybridCacheManager; use the existing symbol names PyKvCacheConfig,
get_hybrid_kv_cache_manager, and MultimodalScenario.kv_cache_reuse to locate and
modify the code.

In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py`:
- Around line 1-2: Replace the SPDX-style two-line header (the lines starting
with "# SPDX-FileCopyrightText" and "# SPDX-License-Identifier") with the
repository's required full NVIDIA Apache 2.0 header block that includes the year
of latest modification and the full license text; update the top of
tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py by removing the
SPDX lines and prepending the canonical multi-line NVIDIA Apache-2.0 header used
across the repo so the file header matches the coding guidelines.

In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py`:
- Around line 349-352: The test moves position tensors to the current CUDA
device with .cuda(), causing potential device-mismatch; update the code that
constructs trtllm_inputs["position_ids"] to keep tensors on self.device by
replacing .cuda() with .to(self.device) (ensure torch.cat(...).to(self.device) /
the final .expand(...).to(self.device) as appropriate) for the block using
mrope_gen_position_ids and the similar occurrence around the other
trtllm_inputs["position_ids"] usage so all position_ids live on self.device
alongside the model and payloads.
- Around line 1-2: This file is missing the required NVIDIA copyright header
block at the top; prepend the full NVIDIA copyright header (the standard
multi-line header used across the repo for .py files) to the very top of this
new Python source (tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py)
so the SPDX lines remain but are preceded by the canonical NVIDIA header block.
- Around line 119-127: The test is missing an assertion that the loaded
dense-only config preserves an empty deepstack index list; update
test_qwen35_dense_vl_config_preserves_vlm_architecture to assert that
config.visual_config.deepstack_visual_indexes is an empty list (or use
getattr(config.visual_config, "deepstack_visual_indexes", []) == []) so the
dense-vs-MoE contract is enforced; locate the assertion block around
config.text_config checks and add this check referring to config and
config.visual_config.deepstack_visual_indexes.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3vl.py`:
- Around line 1286-1315: orig_input_ids is being set from the already
post-processed/fused sequence (after _postprocess and fuse_input_embeds) which
yields multimodal placeholder IDs; instead capture and pass the original
tokenizer IDs (or a text-only token sequence with MM spans removed) created
before _postprocess into SpecDecOneEngineForCausalLM.forward. Locate where
input_ids is first preprocessed (the variable produced before
_postprocess/fuse_input_embeds) and replace the assignment orig_input_ids =
input_ids with orig_input_ids = <pre-_postprocess_token_ids> (or construct a
filtered text-only sequence), then pass that orig_input_ids through the
self.llm.forward call so downstream SpecDecOneEngineForCausalLM.forward and
modeling_speculative.py embedding paths receive valid in-vocab token IDs rather
than the fused placeholder IDs.

---

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py`:
- Line 383: The hardcoded first dimension 3 in the expand call on
trtllm_inputs["position_ids"] + mrope_gen_position_ids should be derived or
documented: replace the literal with the mRoPE section count by using
len(self.hf_config.text_config.rope_parameters["mrope_section"]) (or assign that
length to a local variable like mrope_sections_count) when calling .expand, or
if the model guarantees a fixed 3-section mRoPE, add a concise comment next to
the .expand call (referencing trtllm_inputs, mrope_gen_position_ids, and
.expand) stating that 3 equals the number of mRoPE sections (temporal, height,
width).
- Around line 381-402: The test uses mixed device movement calls: replace direct
.cuda() calls with .to(self.device) for consistency and configurability — update
the lines setting trtllm_inputs["position_ids"] (currently using .cuda()) and
the final position_ids = torch.cat(...).cuda() to use .to(self.device); ensure
any tensors like mrope_gen_position_ids, trtllm_inputs["position_ids"], and the
mrope_position_ids concatenation use .to(self.device) so device handling is
uniform across mrope_gen_position_ids, trtllm_inputs,
gen_multimodal_params_list, and position_ids.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 37daafab-d6b6-4d29-b1f5-cb9518d5881f

📥 Commits

Reviewing files that changed from the base of the PR and between a622e30 and 4bd5c5f.

📒 Files selected for processing (16)

docs/source/models/supported-models.md
tensorrt_llm/_torch/models/__init__.py
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_5_weight_mapper.py
tensorrt_llm/_torch/models/modeling_qwen3_5.py
tensorrt_llm/_torch/models/modeling_qwen3_next.py
tensorrt_llm/_torch/models/modeling_qwen3vl.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/config_utils.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tests/integration/defs/accuracy/references/mmmu.yaml
tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/test-db/l0_h100.yml
tests/unittest/_torch/modeling/test_modeling_multimodal.py
tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl.py
tests/unittest/_torch/modeling/test_modeling_qwen3_5_vl_moe.py

tensorrt-cicd · 2026-06-11T12:02:15Z

PR_Github #53507 [ run ] completed with state SUCCESS. Commit: 4bd5c5f
/LLM/main/L0_MergeRequest_PR pipeline #42665 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

moraxu · 2026-06-11T21:03:58Z

@2ez4bz - feel free to review the 2nd commit only 4bd5c5f - I'll apply yours and CodeRabbit's comments at once

2ez4bz · 2026-06-11T21:29:11Z

+        while keeping `atol` at `0.4` to absorb single-logit tail outliers.
+        Same band as the MoE parity test.
+        """
+        return 0.4, 0.1


Wait wait wait, we have such high deviation from the transformers output?

The multimodal harness default is 0.4, 0.4 (test_modeling_multimodal.py:213) - every other VLM test (Qwen2.5-VL, Qwen3-VL, Gemma3-VL, …) runs at that. We override to 0.4, 0.1, i.e. 4x tighter rtol, same atol. So this is the strictest band in the VLM suite, not a loosening one?

fredricz-20070104 · 2026-06-12T02:08:51Z

Thanks for the work — flagging the items I think should block merge:

Base has conflicts (MERGEABLE=CONFLICTING). Needs a rebase / merge from main before this can land.
TestQwen3_5_27B_VL is missing @skip_pre_hopper. The MoE variant has it; the Dense one was missed. Without the guard it will fail when QA scheduling lands it on pre-Hopper SKUs (L40S/H20).
_Qwen3_5VLModel does not override get_model_defaults. ModelLoader.load_config_and_apply_defaults() resolves defaults against the outer class (Qwen3_5MoeVLModel / Qwen3_5VLModel), so the inner Qwen3NextForCausalLM's {"kv_cache_config": {"enable_block_reuse": False}} is not inherited on the VLM path. That means the VLM path silently runs with enable_block_reuse=True on a Mamba/SSM hybrid LM, which is unsupported. The code's NOTE acknowledges this and defers it to a follow-up — I'd suggest fixing it in this PR (a one-line override) rather than shipping a known-broken default.
Self-flagged accuracy deviation vs. transformers reference in test_modeling_qwen3_5_vl.py ("Wait wait wait, we have such high deviation from the transformers output?"). This needs a conclusion before merge — is it a tolerance issue, or a real accuracy regression? If real, it should at minimum be tracked as a separate follow-up issue, not left as an open question in test code.

fredricz-20070104 · 2026-06-12T02:10:59Z

Thanks for the work on this — a few items I'd suggest addressing before merge:

1. (Blocker) Base conflict. mergeStateStatus is currently DIRTY / CONFLICTING. Needs a rebase/merge of main before this can land.

2. (Blocker) TestQwen3_5_27B_VL is missing @skip_pre_hopper. The MoE variant already has it; the Dense variant doesn't. The QA pool will dispatch this to L40S/H20 and fail. One-line fix — please add it in this PR.

3. (Should-fix in this PR, not a follow-up) _Qwen3_5VLModel does not override get_model_defaults. ModelLoader.load_config_and_apply_defaults() resolves defaults against the outer Qwen3_5MoeVLModel / Qwen3_5VLModel class — it does not inherit the inner Qwen3NextForCausalLM's {"kv_cache_config": {"enable_block_reuse": False}}. The NOTE in the code acknowledges this. The result is that the VLM path silently runs with enable_block_reuse=True on a Mamba/SSM hybrid LM, which is not supported and will produce a silent KV consistency issue for users. This is a one-line override and should be fixed in this PR rather than left as a follow-up.

4. (Need conclusion before merge) Transformers reference deviation. In test_modeling_qwen3_5_vl.py you left the comment "Wait wait wait, we have such high deviation from the transformers output?" — is this a tolerance issue or a real accuracy regression? Please resolve before merge; if it's a real regression, it should be tracked as a separate issue rather than merged with a loose tolerance.

Items 1–4 are merge gates from my side. Other CodeRabbit Major items (e.g. the Qwen3_5MoeHfWeightMapper override inside load_weights) can reasonably be addressed as JIRA follow-ups since you've already acknowledged them.

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

moraxu added 2 commits June 10, 2026 23:36

Original Qwen 3.5 VL MoE PR commit

a748436

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

First draft

4bd5c5f

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

moraxu requested review from a team as code owners June 11, 2026 06:56

moraxu requested review from 2ez4bz, dongxuy04, kaiyux, laikhtewari and tijyojwad June 11, 2026 06:56

github-actions Bot assigned moraxu Jun 11, 2026

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

2ez4bz reviewed Jun 11, 2026

View reviewed changes

jieli-matrix approved these changes Jun 12, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py

Address review comments

ac0d02f

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

Uh oh!

Conversation

moraxu commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

moraxu commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

moraxu commented Jun 11, 2026

Uh oh!

Uh oh!

2ez4bz Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

moraxu Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fredricz-20070104 commented Jun 12, 2026

Uh oh!

fredricz-20070104 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

moraxu commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading