[#12699][feat] consolidate piecewise CUDA graph VLM updates by nvchenghaoz · Pull Request #12852 · NVIDIA/TensorRT-LLM

nvchenghaoz · 2026-04-08T18:35:39Z

Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step.

Refine #12749, Fix #12699

Summary by CodeRabbit

Chores
- Enhanced CUDA graph capture to better handle dynamic batch dimensions and complex model structures.
- Improved graph compilation targeting to support nested GraphModule compilation with proper context propagation.
- Refined output reconstruction logic for piecewise captured graphs to maintain correct tensor shapes and structures.
Tests
- Added comprehensive test coverage for dynamic dimension handling, output reconstruction, and static buffer management in compiled graphs.

Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step. Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Made-with: Cursor

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

coderabbitai · 2026-04-08T18:42:53Z

📝 Walkthrough

Walkthrough

This PR introduces support for piecewise CUDA graph compilation of nested submodules (such as inner language models in VLM wrapper architectures) by adding configurable dynamic tensor slicing dimensions, static buffer reuse mechanisms, and structured output reconstruction capabilities. The compiler now targets specific GraphModule instances within a module hierarchy rather than the entire module, passing the full model context to backends when compiling nested submodules.

Changes

Cohort / File(s)	Summary
CUDA Graph Compilation Core `tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	Extended `CapturedGraph` to accept and auto-detect `dynamic_dims` for configurable tensor slicing beyond dim 0. Updated `PiecewiseCapturedGraph` with `out_spec` for structured output reconstruction, static input buffer allocation/copying for kwargs with unstable addresses, and per-bucket dynamic dimension detection. Enhanced `DualModeCapturedGraph` to proxy attributes to wrapped models, use `batch_info_host` semantics, and replace tuple slicing with dimension-aware `_truncate_output`. Added `_capture_inner_kwargs` helper to intercept kwargs passed to inner GraphModule. Updated `TorchCudagraphCompiler.compile` to support wrapper-aware inner GraphModule capture when `full_model` is provided.
Compile Pipeline Integration `tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py`	Changed compilation strategy to target one or more `torch.fx.GraphModule` instances discovered within the input module instead of the entire module. Collects top-level targets (root GraphModule and nested submodules), invokes backend compilation for each with `full_model` context for nested targets, and replaces compiled submodules using dotted-path setter. Added `_set_submodule` helper function.
Test Coverage `tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py`	Added comprehensive tests for capture-time dynamic dimension detection, per-input slicing extents, output truncation with dynamic dimensions, piecewise graph output reconstruction with unflatten recovery, static input buffer mechanics with shape stability handling, and compilation target selection for nested GraphModule discovery.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Compiler
    participant TCC as TorchCudagraphCompiler
    participant CM as CompileModel
    participant WM as Wrapper Model<br/>(full_model)
    participant IGM as Inner GraphModule
    participant KWC as _capture_inner_kwargs
    participant PCG as PiecewiseCapturedGraph
    
    User->>TCC: compile(full_model=wrapper_model)
    TCC->>WM: forward(args, **kwargs)
    WM->>IGM: forward(top_level_kwargs)
    KWC->>WM: Hook captures kwargs→IGM
    KWC-->>TCC: Returns inner_kwargs
    TCC->>PCG: __init__(out_spec=..., full_model=...)
    PCG->>PCG: _allocate_static_buffers(inner_kwargs)
    PCG->>PCG: Detect dynamic_dims per bucket
    TCC-->>User: Returns compiled_wrapper

sequenceDiagram
    participant RT as Runtime
    participant PCG as PiecewiseCapturedGraph
    participant SB as Static Buffers
    participant CG as CUDA Graph<br/>(Piecewise)
    
    RT->>PCG: forward(args, kwargs)
    PCG->>SB: _copy_to_static_buffers(kwargs)
    SB->>SB: Copy with narrow() at dynamic_dim
    PCG->>CG: Replay with sliced inputs
    CG-->>PCG: Output (flat tuple)
    PCG->>PCG: _reconstruct_output(out_spec)
    PCG-->>RT: Structured output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title '[`#12699`][feat] consolidate piecewise CUDA graph VLM updates' clearly and concisely describes the main change: consolidating piecewise CUDA graph updates for VLM support into a single commit. It references the issue number, specifies the type, and accurately summarizes the primary objective.
Description check	✅ Passed	The PR description provides the essential context: it explains the purpose (squashing changes for downstream cherry-picking), references related PRs/issues (`#12749`, `#12699`), and includes a request for automated summary. While it's brief, it contains sufficient information for understanding the PR's intent.
Linked Issues check	✅ Passed	The PR comprehensively addresses all objectives from issue `#12699`: enabling piecewise cudagraph capture for VLM subgraphs, handling input argument differences, managing static buffer addresses, and supporting monolithic mode with dynamic dimensions. The code changes across all three files directly implement these requirements.
Out of Scope Changes check	✅ Passed	All code changes are directly aligned with the linked issue `#12699` objectives. The modifications to torch_cudagraph.py, compile_model.py, and test additions focus exclusively on supporting piecewise CUDA graph capture for VLMs without introducing unrelated functionality.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch chenghao/piecewise_cg_vlm_0408

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

152-153: Consider adding strict=True for length validation.

Both lists are expected to have the same length by construction, but adding strict=True provides an extra safety check against mismatched output specs.

♻️ Suggested refinement

-            for o_buffer, o in zip(self._out_buffer_flat, out_flat):
+            for o_buffer, o in zip(self._out_buffer_flat, out_flat, strict=True):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py` around
lines 152 - 153, The zip over outputs in the loop pairing self._out_buffer_flat
and out_flat should use strict=True to assert both sequences have the same
length; update the loop in torch_cudagraph.py where you currently iterate "for
o_buffer, o in zip(self._out_buffer_flat, out_flat):" to use
zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs
raise an error immediately (locations to check: the symbols
self._out_buffer_flat, out_flat, and the loop performing
o_buffer.narrow(...).copy_(o)).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`:
- Around line 152-153: The zip over outputs in the loop pairing
self._out_buffer_flat and out_flat should use strict=True to assert both
sequences have the same length; update the loop in torch_cudagraph.py where you
currently iterate "for o_buffer, o in zip(self._out_buffer_flat, out_flat):" to
use zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs
raise an error immediately (locations to check: the symbols
self._out_buffer_flat, out_flat, and the loop performing
o_buffer.narrow(...).copy_(o)).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 35197296-fa46-4373-aa6e-b898ee524c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and bd7342e.

📒 Files selected for processing (3)

tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py

nvchenghaoz · 2026-04-08T18:43:49Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-08T18:49:51Z

PR_Github #42383 [ run ] triggered by Bot. Commit: 021821a Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz · 2026-04-08T20:01:54Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-08T20:08:34Z

PR_Github #42391 [ run ] triggered by Bot. Commit: 0248f7c Link to invocation

tensorrt-cicd · 2026-04-08T20:08:37Z

PR_Github #42383 [ run ] completed with state ABORTED. Commit: 021821a

Link to invocation

tensorrt-cicd · 2026-04-09T00:06:28Z

PR_Github #42391 [ run ] completed with state SUCCESS. Commit: 0248f7c
/LLM/main/L0_MergeRequest_PR pipeline #33167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

taylor-yb-lee

Tested, LGTM

nvchenghaoz · 2026-04-09T02:16:21Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-09T02:22:12Z

PR_Github #42430 [ run ] triggered by Bot. Commit: 0248f7c Link to invocation

tensorrt-cicd · 2026-04-09T08:23:26Z

PR_Github #42430 [ run ] completed with state SUCCESS. Commit: 0248f7c
/LLM/main/L0_MergeRequest_PR pipeline #33201 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nvchenghaoz · 2026-04-09T15:52:21Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-09T15:59:36Z

PR_Github #42556 [ run ] triggered by Bot. Commit: 5f2a4f3 Link to invocation

tensorrt-cicd · 2026-04-09T20:33:59Z

PR_Github #42556 [ run ] completed with state SUCCESS. Commit: 5f2a4f3
/LLM/main/L0_MergeRequest_PR pipeline #33293 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nvchenghaoz · 2026-04-09T23:54:27Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-04-10T00:01:13Z

PR_Github #42594 [ run ] triggered by Bot. Commit: 5f2a4f3 Link to invocation

tensorrt-cicd · 2026-04-10T11:44:10Z

PR_Github #42594 [ run ] completed with state SUCCESS. Commit: 5f2a4f3
/LLM/main/L0_MergeRequest_PR pipeline #33321 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

nvchenghaoz · 2026-04-10T17:00:16Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-04-10T18:39:02Z

PR_Github #42717 [ run ] triggered by Bot. Commit: 1cdc087 Link to invocation

tensorrt-cicd · 2026-04-11T03:32:42Z

PR_Github #42717 [ run ] completed with state SUCCESS. Commit: 1cdc087
/LLM/main/L0_MergeRequest_PR pipeline #33409 completed with status: 'SUCCESS'

CI Report

Link to invocation

nvchenghaoz requested a review from taylor-yb-lee April 8, 2026 18:35

nvchenghaoz requested a review from a team as a code owner April 8, 2026 18:35

github-actions bot assigned nvchenghaoz Apr 8, 2026

nvchenghaoz added 2 commits April 8, 2026 11:38

resolve the merge conflict

99c4f3a

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

Merge branch 'main' into chenghao/piecewise_cg_vlm_0408

021821a

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

minor fix for the gragh compile

0248f7c

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

taylor-yb-lee approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into chenghao/piecewise_cg_vlm_0408

5f2a4f3

fix test errors

1cdc087

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

Conversation

nvchenghaoz commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

nvchenghaoz commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

taylor-yb-lee left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

nvchenghaoz commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

nvchenghaoz commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

nvchenghaoz commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvchenghaoz commented Apr 8, 2026 •

edited by coderabbitai bot

Loading