Skip to content

[None][feat] VisualGen: enable CUDA graph capture with torch.compile#15603

Open
chang-l wants to merge 1 commit into
NVIDIA:mainfrom
chang-l:visgen-cudagraph-with-torch-compile
Open

[None][feat] VisualGen: enable CUDA graph capture with torch.compile#15603
chang-l wants to merge 1 commit into
NVIDIA:mainfrom
chang-l:visgen-cudagraph-with-torch-compile

Conversation

@chang-l

@chang-l chang-l commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

BasePipeline._setup_cuda_graphs() skipped CUDA graph capture entirely whenever
torch.compile was enabled, logging "CUDA graphs with torch.compile not yet
supported. Using torch.compile only."
Since torch.compile is on by default for
VisualGen, opting into cuda_graph_config.enable=True alongside it silently did
nothing.

The two actually compose. The CUDA graph runner wraps the outer transformer
forward, while torch.compile compiles the inner transformer blocks (the
per-block path taken by every VisualGen transformer — WAN, FLUX/FLUX2, Cosmos3,
Qwen-Image, LTX-2). Graph capture happens during warmup(), after the runner's
own WARMUP_STEPS eager iterations have already triggered torch.compile's lazy
compilation — so the captured graph contains the optimized compiled kernels.

This is exactly the pattern LTX2Pipeline already implements in its
_setup_cuda_graphs() override; this PR brings the base class in line by removing
the stale early-return.

CUDA graph remains opt-inCudaGraphConfig.enable still defaults to False,
so the default torch.compile-only path is unchanged. Users now get both
optimizations together when they set cuda_graph_config.enable=True.

Test Coverage

  • Existing VisualGen pipeline tests under tests/unittest/_torch/visual_gen/
    exercise the torch_compile and cuda_graph paths.
  • LTX2Pipeline already ships the compile + CUDA-graph composition via its own
    _setup_cuda_graphs() override, validating the ordering this PR adopts in the
    base class.
  • Verified end-to-end on a B200 (release:1.3.0rc19 container) with Qwen-Image,
    both cuda_graph_config.enable=True and torch_compile_config.enable=True:
    the CUDA graph is captured over the torch.compiled blocks, and the generated
    image is byte-identical to the compile-only baseline (matching MD5, PSNR ∞).
    See the verification comment below for logs.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of my knowledge.

  • No API changes (no api-compatible/api-breaking label needed).

  • No new dependencies.

  • No CODEOWNERS / tava diagram changes required.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

🤖 Generated with Claude Code

The base pipeline's _setup_cuda_graphs() skipped CUDA graph capture
entirely whenever torch.compile was enabled, logging "CUDA graphs with
torch.compile not yet supported." Because torch.compile defaults on,
opting into cuda_graph alongside it silently did nothing.

The two compose: the CUDA graph runner wraps the outer transformer
forward while torch.compile compiles the inner transformer blocks (the
per-block path used by all VisualGen transformers). Graph capture runs
during warmup, after the runner's own WARMUP_STEPS eager iterations have
already triggered torch.compile's lazy compilation, so the captured
graph holds the optimized compiled kernels. LTX2Pipeline already
overrides _setup_cuda_graphs() this way; this brings the base class in
line.

CUDA graph remains opt-in (CudaGraphConfig.enable defaults to False).

Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
@chang-l chang-l requested a review from a team as a code owner June 24, 2026 20:55
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

BasePipeline._setup_cuda_graphs now continues CUDA graph setup when torch_compile.enable is set. The docstring and log message were updated to describe and label the combined CUDA graph and torch.compile path.

Changes

CUDA graph setup with torch.compile

Layer / File(s) Summary
CUDA graph setup and log annotation
tensorrt_llm/_torch/visual_gen/pipeline.py
The method no longer returns early when torch_compile.enable is set, and the CUDA graph runner wrapping log includes a compile note when that mode is enabled.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the PR’s main change and follows the required [None][feat] format.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description includes the required sections, clearly explains the change and test coverage, and aligns with the repository template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@chang-l chang-l requested review from NVShreyas and luyiyun1021 June 24, 2026 22:06
@chang-l

chang-l commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Verified on B200 (Qwen-Image)

Ran on a B200 in the nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc19 container with the real Qwen-Image checkpoint (512×512, 8 steps), applying this PR's change to the installed package. rc19's _setup_cuda_graphs is identical to this PR's base, so it's a faithful test.

Stock rc19 — both cuda_graph + torch_compile enabled: logs "CUDA graphs with torch.compile not yet supported. Using torch.compile only." → CUDA graph silently skipped (compile-only).

With this PR — both enabled:

  • CUDA graph runner: wrapping PipelineComponent.TRANSFORMER.forward
  • torch.compile: ...transformer_blocks (60 blocks, mode=default)
  • Capturing graph for key: (...) during warmup and at the generate shape

No errors / no illegal-memory access. The generated image is byte-identical to the compile-only baseline (matching MD5, PSNR ∞, max pixel diff 0) — CUDA-graph replay of the torch.compiled kernels is numerically exact.

This is the same ordering LTX2Pipeline._setup_cuda_graphs() already relies on.

@chang-l

chang-l commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55622 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55622 [ run ] completed with state FAILURE. Commit: 78bf3a8
/LLM/main/L0_MergeRequest_PR pipeline #44538 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chang-l

chang-l commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55675 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55675 [ run ] completed with state SUCCESS. Commit: 78bf3a8
/LLM/main/L0_MergeRequest_PR pipeline #44581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chang-l

chang-l commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55759 [ run ] triggered by Bot. Commit: 78bf3a8 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants