[None][feat] VisualGen: enable CUDA graph capture with torch.compile by chang-l · Pull Request #15603 · NVIDIA/TensorRT-LLM

chang-l · 2026-06-24T20:55:42Z

Description

BasePipeline._setup_cuda_graphs() skipped CUDA graph capture entirely whenever
torch.compile was enabled, logging "CUDA graphs with torch.compile not yet
supported. Using torch.compile only." Since torch.compile is on by default for
VisualGen, opting into cuda_graph_config.enable=True alongside it silently did
nothing.

The two actually compose. The CUDA graph runner wraps the outer transformer
forward, while torch.compile compiles the inner transformer blocks (the
per-block path taken by every VisualGen transformer — WAN, FLUX/FLUX2, Cosmos3,
Qwen-Image, LTX-2). Graph capture happens during warmup(), after the runner's
own WARMUP_STEPS eager iterations have already triggered torch.compile's lazy
compilation — so the captured graph contains the optimized compiled kernels.

This is exactly the pattern LTX2Pipeline already implements in its
_setup_cuda_graphs() override; this PR brings the base class in line by removing
the stale early-return.

CUDA graph remains opt-in — CudaGraphConfig.enable still defaults to False,
so the default torch.compile-only path is unchanged. Users now get both
optimizations together when they set cuda_graph_config.enable=True.

Test Coverage

Existing VisualGen pipeline tests under tests/unittest/_torch/visual_gen/
exercise the torch_compile and cuda_graph paths.
LTX2Pipeline already ships the compile + CUDA-graph composition via its own
_setup_cuda_graphs() override, validating the ordering this PR adopts in the
base class.
Verified end-to-end on a B200 (release:1.3.0rc19 container) with Qwen-Image,
both cuda_graph_config.enable=True and torch_compile_config.enable=True:
the CUDA graph is captured over the torch.compiled blocks, and the generated
image is byte-identical to the compile-only baseline (matching MD5, PSNR ∞).
See the verification comment below for logs.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES to the best of my knowledge.
No API changes (no api-compatible/api-breaking label needed).
No new dependencies.
No CODEOWNERS / tava diagram changes required.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

🤖 Generated with Claude Code

The base pipeline's _setup_cuda_graphs() skipped CUDA graph capture entirely whenever torch.compile was enabled, logging "CUDA graphs with torch.compile not yet supported." Because torch.compile defaults on, opting into cuda_graph alongside it silently did nothing. The two compose: the CUDA graph runner wraps the outer transformer forward while torch.compile compiles the inner transformer blocks (the per-block path used by all VisualGen transformers). Graph capture runs during warmup, after the runner's own WARMUP_STEPS eager iterations have already triggered torch.compile's lazy compilation, so the captured graph holds the optimized compiled kernels. LTX2Pipeline already overrides _setup_cuda_graphs() this way; this brings the base class in line. CUDA graph remains opt-in (CudaGraphConfig.enable defaults to False). Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>

coderabbitai · 2026-06-24T20:58:26Z

📝 Walkthrough

Walkthrough

BasePipeline._setup_cuda_graphs now continues CUDA graph setup when torch_compile.enable is set. The docstring and log message were updated to describe and label the combined CUDA graph and torch.compile path.

Changes

CUDA graph setup with torch.compile

Layer / File(s)	Summary
CUDA graph setup and log annotation `tensorrt_llm/_torch/visual_gen/pipeline.py`	The method no longer returns early when `torch_compile.enable` is set, and the CUDA graph runner wrapping log includes a compile note when that mode is enabled.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the PR’s main change and follows the required [None][feat] format.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description includes the required sections, clearly explains the change and test coverage, and aligns with the repository template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

chang-l · 2026-06-24T23:54:12Z

Verified on B200 (Qwen-Image)

Ran on a B200 in the nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc19 container with the real Qwen-Image checkpoint (512×512, 8 steps), applying this PR's change to the installed package. rc19's _setup_cuda_graphs is identical to this PR's base, so it's a faithful test.

Stock rc19 — both cuda_graph + torch_compile enabled: logs "CUDA graphs with torch.compile not yet supported. Using torch.compile only." → CUDA graph silently skipped (compile-only).

With this PR — both enabled:

CUDA graph runner: wrapping PipelineComponent.TRANSFORMER.forward
torch.compile: ...transformer_blocks (60 blocks, mode=default)
Capturing graph for key: (...) during warmup and at the generate shape

No errors / no illegal-memory access. The generated image is byte-identical to the compile-only baseline (matching MD5, PSNR ∞, max pixel diff 0) — CUDA-graph replay of the torch.compiled kernels is numerically exact.

This is the same ordering LTX2Pipeline._setup_cuda_graphs() already relies on.

chang-l · 2026-06-25T00:08:30Z

/bot run

tensorrt-cicd · 2026-06-25T00:15:10Z

PR_Github #55622 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

tensorrt-cicd · 2026-06-25T02:23:50Z

PR_Github #55622 [ run ] completed with state FAILURE. Commit: 78bf3a8
/LLM/main/L0_MergeRequest_PR pipeline #44538 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chang-l · 2026-06-25T02:45:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T02:51:22Z

PR_Github #55675 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

tensorrt-cicd · 2026-06-25T10:09:33Z

PR_Github #55675 [ run ] completed with state SUCCESS. Commit: 78bf3a8
/LLM/main/L0_MergeRequest_PR pipeline #44581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

chang-l · 2026-06-25T10:15:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T10:21:27Z

PR_Github #55759 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

chang-l requested a review from a team as a code owner June 24, 2026 20:55

github-actions Bot assigned chang-l Jun 24, 2026

chang-l requested review from NVShreyas and luyiyun1021 June 24, 2026 22:06

luyiyun1021 approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] VisualGen: enable CUDA graph capture with torch.compile#15603

[None][feat] VisualGen: enable CUDA graph capture with torch.compile#15603
chang-l wants to merge 1 commit into
NVIDIA:mainfrom
chang-l:visgen-cudagraph-with-torch-compile

chang-l commented Jun 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

chang-l commented Jun 24, 2026

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

chang-l commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

chang-l commented Jun 24, 2026

Verified on B200 (Qwen-Image)

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

chang-l commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chang-l commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading