Skip to content

[#12699][feat] consolidate piecewise CUDA graph VLM updates#12852

Open
nvchenghaoz wants to merge 6 commits intoNVIDIA:mainfrom
nv-auto-deploy:chenghao/piecewise_cg_vlm_0408
Open

[#12699][feat] consolidate piecewise CUDA graph VLM updates#12852
nvchenghaoz wants to merge 6 commits intoNVIDIA:mainfrom
nv-auto-deploy:chenghao/piecewise_cg_vlm_0408

Conversation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator

@nvchenghaoz nvchenghaoz commented Apr 8, 2026

Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step.

Refine #12749, Fix #12699

Summary by CodeRabbit

  • Chores

    • Enhanced CUDA graph capture to better handle dynamic batch dimensions and complex model structures.
    • Improved graph compilation targeting to support nested GraphModule compilation with proper context propagation.
    • Refined output reconstruction logic for piecewise captured graphs to maintain correct tensor shapes and structures.
  • Tests

    • Added comprehensive test coverage for dynamic dimension handling, output reconstruction, and static buffer management in compiled graphs.

Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step.

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Made-with: Cursor
@nvchenghaoz nvchenghaoz requested a review from a team as a code owner April 8, 2026 18:35
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

This PR introduces support for piecewise CUDA graph compilation of nested submodules (such as inner language models in VLM wrapper architectures) by adding configurable dynamic tensor slicing dimensions, static buffer reuse mechanisms, and structured output reconstruction capabilities. The compiler now targets specific GraphModule instances within a module hierarchy rather than the entire module, passing the full model context to backends when compiling nested submodules.

Changes

Cohort / File(s) Summary
CUDA Graph Compilation Core
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Extended CapturedGraph to accept and auto-detect dynamic_dims for configurable tensor slicing beyond dim 0. Updated PiecewiseCapturedGraph with out_spec for structured output reconstruction, static input buffer allocation/copying for kwargs with unstable addresses, and per-bucket dynamic dimension detection. Enhanced DualModeCapturedGraph to proxy attributes to wrapped models, use batch_info_host semantics, and replace tuple slicing with dimension-aware _truncate_output. Added _capture_inner_kwargs helper to intercept kwargs passed to inner GraphModule. Updated TorchCudagraphCompiler.compile to support wrapper-aware inner GraphModule capture when full_model is provided.
Compile Pipeline Integration
tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
Changed compilation strategy to target one or more torch.fx.GraphModule instances discovered within the input module instead of the entire module. Collects top-level targets (root GraphModule and nested submodules), invokes backend compilation for each with full_model context for nested targets, and replaces compiled submodules using dotted-path setter. Added _set_submodule helper function.
Test Coverage
tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
Added comprehensive tests for capture-time dynamic dimension detection, per-input slicing extents, output truncation with dynamic dimensions, piecewise graph output reconstruction with unflatten recovery, static input buffer mechanics with shape stability handling, and compilation target selection for nested GraphModule discovery.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Compiler
    participant TCC as TorchCudagraphCompiler
    participant CM as CompileModel
    participant WM as Wrapper Model<br/>(full_model)
    participant IGM as Inner GraphModule
    participant KWC as _capture_inner_kwargs
    participant PCG as PiecewiseCapturedGraph
    
    User->>TCC: compile(full_model=wrapper_model)
    TCC->>WM: forward(args, **kwargs)
    WM->>IGM: forward(top_level_kwargs)
    KWC->>WM: Hook captures kwargs→IGM
    KWC-->>TCC: Returns inner_kwargs
    TCC->>PCG: __init__(out_spec=..., full_model=...)
    PCG->>PCG: _allocate_static_buffers(inner_kwargs)
    PCG->>PCG: Detect dynamic_dims per bucket
    TCC-->>User: Returns compiled_wrapper
Loading
sequenceDiagram
    participant RT as Runtime
    participant PCG as PiecewiseCapturedGraph
    participant SB as Static Buffers
    participant CG as CUDA Graph<br/>(Piecewise)
    
    RT->>PCG: forward(args, kwargs)
    PCG->>SB: _copy_to_static_buffers(kwargs)
    SB->>SB: Copy with narrow() at dynamic_dim
    PCG->>CG: Replay with sliced inputs
    CG-->>PCG: Output (flat tuple)
    PCG->>PCG: _reconstruct_output(out_spec)
    PCG-->>RT: Structured output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '[#12699][feat] consolidate piecewise CUDA graph VLM updates' clearly and concisely describes the main change: consolidating piecewise CUDA graph updates for VLM support into a single commit. It references the issue number, specifies the type, and accurately summarizes the primary objective.
Description check ✅ Passed The PR description provides the essential context: it explains the purpose (squashing changes for downstream cherry-picking), references related PRs/issues (#12749, #12699), and includes a request for automated summary. While it's brief, it contains sufficient information for understanding the PR's intent.
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issue #12699: enabling piecewise cudagraph capture for VLM subgraphs, handling input argument differences, managing static buffer addresses, and supporting monolithic mode with dynamic dimensions. The code changes across all three files directly implement these requirements.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with the linked issue #12699 objectives. The modifications to torch_cudagraph.py, compile_model.py, and test additions focus exclusively on supporting piecewise CUDA graph capture for VLMs without introducing unrelated functionality.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch chenghao/piecewise_cg_vlm_0408

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)

152-153: Consider adding strict=True for length validation.

Both lists are expected to have the same length by construction, but adding strict=True provides an extra safety check against mismatched output specs.

♻️ Suggested refinement
-            for o_buffer, o in zip(self._out_buffer_flat, out_flat):
+            for o_buffer, o in zip(self._out_buffer_flat, out_flat, strict=True):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py` around
lines 152 - 153, The zip over outputs in the loop pairing self._out_buffer_flat
and out_flat should use strict=True to assert both sequences have the same
length; update the loop in torch_cudagraph.py where you currently iterate "for
o_buffer, o in zip(self._out_buffer_flat, out_flat):" to use
zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs
raise an error immediately (locations to check: the symbols
self._out_buffer_flat, out_flat, and the loop performing
o_buffer.narrow(...).copy_(o)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`:
- Around line 152-153: The zip over outputs in the loop pairing
self._out_buffer_flat and out_flat should use strict=True to assert both
sequences have the same length; update the loop in torch_cudagraph.py where you
currently iterate "for o_buffer, o in zip(self._out_buffer_flat, out_flat):" to
use zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs
raise an error immediately (locations to check: the symbols
self._out_buffer_flat, out_flat, and the loop performing
o_buffer.narrow(...).copy_(o)).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 35197296-fa46-4373-aa6e-b898ee524c1d

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and bd7342e.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
  • tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42383 [ run ] triggered by Bot. Commit: 021821a Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42391 [ run ] triggered by Bot. Commit: 0248f7c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42383 [ run ] completed with state ABORTED. Commit: 021821a

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42391 [ run ] completed with state SUCCESS. Commit: 0248f7c
/LLM/main/L0_MergeRequest_PR pipeline #33167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@taylor-yb-lee taylor-yb-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, LGTM

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42430 [ run ] triggered by Bot. Commit: 0248f7c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42430 [ run ] completed with state SUCCESS. Commit: 0248f7c
/LLM/main/L0_MergeRequest_PR pipeline #33201 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42556 [ run ] triggered by Bot. Commit: 5f2a4f3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42556 [ run ] completed with state SUCCESS. Commit: 5f2a4f3
/LLM/main/L0_MergeRequest_PR pipeline #33293 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42594 [ run ] triggered by Bot. Commit: 5f2a4f3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42594 [ run ] completed with state SUCCESS. Commit: 5f2a4f3
/LLM/main/L0_MergeRequest_PR pipeline #33321 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
@nvchenghaoz
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42717 [ run ] triggered by Bot. Commit: 1cdc087 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42717 [ run ] completed with state SUCCESS. Commit: 1cdc087
/LLM/main/L0_MergeRequest_PR pipeline #33409 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: [AutoDeploy] Support piecewise cudagraph for VLM models

3 participants