[#12699][feat] consolidate piecewise CUDA graph VLM updates#12852
[#12699][feat] consolidate piecewise CUDA graph VLM updates#12852nvchenghaoz wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step. Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Made-with: Cursor
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR introduces support for piecewise CUDA graph compilation of nested submodules (such as inner language models in VLM wrapper architectures) by adding configurable dynamic tensor slicing dimensions, static buffer reuse mechanisms, and structured output reconstruction capabilities. The compiler now targets specific GraphModule instances within a module hierarchy rather than the entire module, passing the full model context to backends when compiling nested submodules. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Compiler
participant TCC as TorchCudagraphCompiler
participant CM as CompileModel
participant WM as Wrapper Model<br/>(full_model)
participant IGM as Inner GraphModule
participant KWC as _capture_inner_kwargs
participant PCG as PiecewiseCapturedGraph
User->>TCC: compile(full_model=wrapper_model)
TCC->>WM: forward(args, **kwargs)
WM->>IGM: forward(top_level_kwargs)
KWC->>WM: Hook captures kwargs→IGM
KWC-->>TCC: Returns inner_kwargs
TCC->>PCG: __init__(out_spec=..., full_model=...)
PCG->>PCG: _allocate_static_buffers(inner_kwargs)
PCG->>PCG: Detect dynamic_dims per bucket
TCC-->>User: Returns compiled_wrapper
sequenceDiagram
participant RT as Runtime
participant PCG as PiecewiseCapturedGraph
participant SB as Static Buffers
participant CG as CUDA Graph<br/>(Piecewise)
RT->>PCG: forward(args, kwargs)
PCG->>SB: _copy_to_static_buffers(kwargs)
SB->>SB: Copy with narrow() at dynamic_dim
PCG->>CG: Replay with sliced inputs
CG-->>PCG: Output (flat tuple)
PCG->>PCG: _reconstruct_output(out_spec)
PCG-->>RT: Structured output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1)
152-153: Consider addingstrict=Truefor length validation.Both lists are expected to have the same length by construction, but adding
strict=Trueprovides an extra safety check against mismatched output specs.♻️ Suggested refinement
- for o_buffer, o in zip(self._out_buffer_flat, out_flat): + for o_buffer, o in zip(self._out_buffer_flat, out_flat, strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py` around lines 152 - 153, The zip over outputs in the loop pairing self._out_buffer_flat and out_flat should use strict=True to assert both sequences have the same length; update the loop in torch_cudagraph.py where you currently iterate "for o_buffer, o in zip(self._out_buffer_flat, out_flat):" to use zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs raise an error immediately (locations to check: the symbols self._out_buffer_flat, out_flat, and the loop performing o_buffer.narrow(...).copy_(o)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`:
- Around line 152-153: The zip over outputs in the loop pairing
self._out_buffer_flat and out_flat should use strict=True to assert both
sequences have the same length; update the loop in torch_cudagraph.py where you
currently iterate "for o_buffer, o in zip(self._out_buffer_flat, out_flat):" to
use zip(self._out_buffer_flat, out_flat, strict=True) so mismatched output specs
raise an error immediately (locations to check: the symbols
self._out_buffer_flat, out_flat, and the loop performing
o_buffer.narrow(...).copy_(o)).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 35197296-fa46-4373-aa6e-b898ee524c1d
📒 Files selected for processing (3)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/transform/library/compile_model.pytests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42383 [ run ] triggered by Bot. Commit: |
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42391 [ run ] triggered by Bot. Commit: |
|
PR_Github #42383 [ run ] completed with state |
|
PR_Github #42391 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42430 [ run ] triggered by Bot. Commit: |
|
PR_Github #42430 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42556 [ run ] triggered by Bot. Commit: |
|
PR_Github #42556 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #42594 [ run ] triggered by Bot. Commit: |
|
PR_Github #42594 [ run ] completed with state
|
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #42717 [ run ] triggered by Bot. Commit: |
|
PR_Github #42717 [ run ] completed with state |
Squash the piecewise CUDA graph VLM changes into a single commit so downstream branches can cherry-pick the complete update in one step.
Refine #12749, Fix #12699
Summary by CodeRabbit
Chores
Tests