Skip to content

fix(pt): replace in-place nlist masking to enable CUDA graph capture#5433

Merged
njzjz merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-cuda-graph-nlist-mask
May 7, 2026
Merged

fix(pt): replace in-place nlist masking to enable CUDA graph capture#5433
njzjz merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-cuda-graph-nlist-mask

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 6, 2026

Summary

  • Replace nlist[nlist == -1] = 0 in DescrptBlockRepformers.forward (DPA-2) and DescrptBlockRepflows.forward (DPA-3, two occurrences) with nlist = torch.where(nlist == -1, 0, nlist).
  • TorchScript was lowering the indexed-assignment into an index_put_ whose value was torch.tensor(0, device=...) — a per-call scalar allocation that triggers a CPU↔GPU sync and is forbidden during CUDA stream capture (cudaErrorStreamCaptureUnsupported). This made forward_lower of DPA-2 / DPA-3 models non-capturable from the LAMMPS C++ plugin, blocking the ~8–12× CUDA-graph speedup reported in [Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models #5432.
  • aten::where with a Python scalar embeds the literal in the kernel without allocating a device tensor, so the resulting frozen IR is graph-capturable. The pattern matches what se_atten.py and se_t_tebd.py already use.

Notes

  • nlist / a_nlist were already rebound to local copies via torch.where(exclude_mask != 0, ...) earlier in both functions, so dropping the in-place mutation has no observable effect on callers.
  • Existing deployed .pth files won't benefit until re-exported (their TorchScript IR is frozen).
  • The reporter also flagged that the comm_dict ghost-atom path in forward_lower may still contain other CPU↔GPU sync points needed for full LAMMPS capture; that audit is out of scope here.

Closes #5432.

Test plan

  • pytest source/tests/pt/model/test_dpa3.py — passes (exercises torch.jit.script on the DPA-3 descriptor → covers patched repflows.py lines)
  • pytest source/tests/pt/model/test_dpa2.py — passes (numerical consistency of DPA-2 descriptor → covers patched repformers.py line)
  • Inline torch.jit.script(DescrptDPA2(...)) succeeds (verifies new torch.where(... == -1, 0, nlist) form scripts cleanly)
  • CI: full test_jit.py and consistency suites

Summary by CodeRabbit

  • Refactor
    • Optimized internal tensor operations in descriptor modules for improved performance and consistency.

`nlist[nlist == -1] = 0` in DescrptBlockRepformers.forward and
DescrptBlockRepflows.forward is lowered by TorchScript into an
`index_put_` whose value is built by `torch.tensor(0, device=...)` on
every forward pass. The per-call scalar allocation triggers a CPU->GPU
sync, which is forbidden during CUDA stream capture
(`cudaErrorStreamCaptureUnsupported`), making `forward_lower` of DPA-2
and DPA-3 models non-capturable from the LAMMPS C++ plugin.

Switch to `nlist = torch.where(nlist == -1, 0, nlist)`, matching the
pattern already used in se_atten.py and se_t_tebd.py. `aten::where`
with a Python scalar takes the literal through the kernel without
allocating a device tensor, so the resulting frozen IR is CUDA-graph
capturable. `nlist`/`a_nlist` were already rebound to local copies
earlier in both functions, so dropping the in-place mutation has no
observable effect on callers.

Closes deepmodeling#5432.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

Padding sentinel handling in RepFlow and Repformers descriptor forward methods is refactored from in-place index assignment to vectorized torch.where operations to replace -1 padding indices with 0, addressing CUDA graph capture incompatibility.

Changes

Padding Sentinel Replacement in Descriptors

Layer / File(s) Summary
Core Implementation
deepmd/pt/model/descriptor/repflows.py, deepmd/pt/model/descriptor/repformers.py
Padding indices marked with -1 are now replaced with 0 using torch.where(nlist == -1, 0, nlist) instead of in-place assignment (nlist[nlist == -1] = 0) or torch.index_put_() with dynamically allocated tensors. Applied to both nlist and a_nlist in repflows, and nlist in repformers.

Sequence Diagram

sequenceDiagram
    participant Caller as DPA-2 Model
    participant Descriptor as Descriptor Forward
    participant OldPath as Old: torch.tensor() + index_put_()
    participant NewPath as New: torch.where()
    participant CUDA as CUDA Graph Capture

    Caller->>Descriptor: forward_lower()
    Descriptor->>NewPath: Process nlist with torch.where()
    NewPath->>NewPath: Replace -1 → 0 (no dynamic allocation)
    NewPath-->>CUDA: Compatible with stream capture
    CUDA-->>Caller: Graph cached & replayed

    Note over OldPath: Previous: torch.tensor(0, device='cuda')<br/>caused cudaErrorStreamCaptureUnsupported<br/>on every forward pass
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: replacing in-place nlist masking with torch.where to enable CUDA graph capture. It is concise, specific, and directly related to the primary objective of the changeset.
Linked Issues check ✅ Passed The PR replaces in-place indexed-assignment patterns with torch.where in both repformers.py and repflows.py as required by issue #5432, removing dynamic GPU scalar allocations to enable CUDA graph capture.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the CUDA graph capture issue by replacing in-place masking in descriptor modules; no unrelated modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepmd/pt/model/descriptor/repflows.py (1)

619-637: Remaining per-call torch.tensor() allocations in the comm_dict path.

The border_op call sites (lines 627–636) still allocate new CPU tensors (torch.tensor(real_nloc, dtype=torch.int32, device="cpu") and torch.tensor(real_nall - real_nloc, ...)) on every forward pass. These live on CPU, so they don't produce cudaErrorStreamCaptureUnsupported today, but they are still per-call host allocations inside the training loop. If CUDA graph capture of the parallel/comm_dict path is ever needed, these will be the next sync point to address (e.g., pre-allocating a torch.zeros(1, dtype=torch.int32) buffer and filling it once, or passing Python ints directly if the op accepts them).


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2fd617c5-c738-4cdc-be39-55e791b6e3d9

📥 Commits

Reviewing files that changed from the base of the PR and between 0a481de and 0e3fcc6.

📒 Files selected for processing (2)
  • deepmd/pt/model/descriptor/repflows.py
  • deepmd/pt/model/descriptor/repformers.py

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.47%. Comparing base (0a481de) to head (0e3fcc6).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5433   +/-   ##
=======================================
  Coverage   82.47%   82.47%           
=======================================
  Files         825      825           
  Lines       87721    87721           
  Branches     4206     4207    +1     
=======================================
  Hits        72344    72344           
- Misses      14093    14094    +1     
+ Partials     1284     1283    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I checked the diff: this only replaces in-place boolean-index assignment on nlist/a_nlist with torch.where, which avoids mutating tensors during CUDA graph capture while preserving the effective masking behavior before downstream descriptor use. No compatibility concern from my side.

Note: I could not run local uv run ruff because the current uv dependency resolution fails on mutually pinned torch CPU/GPU extras in this environment, so this approval is based on diff inspection and the PR CI status.

Reviewed by OpenClaw 2026.4.22 (00bd2cf) (model: gpt-5.5).
Authored by OpenClaw (model: gpt-5.5)

@njzjz njzjz added this pull request to the merge queue May 7, 2026
@ganfisher
Copy link
Copy Markdown

I wonder whether this bug is related to #5087, since I encountered the same problem with DPA3

Merged via the queue into deepmodeling:master with commit 5bd0889 May 7, 2026
73 checks passed
@d1l-netizen
Copy link
Copy Markdown

Thank you for the suggested changes! I applied the patches to repformers.py and reflow.py,
re-froze the model with a fixed nsel for static tensor shapes, and ran a LAMMPS smoke test.

The CUDA graph model loads and runs correctly — libcudart.so.12 loads successfully and the run
completes without errors.

Performance on a small test system (~700 atoms, L40S GPU):

  • Baseline (eager): ~10.7 steps/s
  • CUDA graph: ~10.4 steps/s (~0.97×, essentially no change)

No speedup at this system size, which is expected — at ~700 atoms the DPA-2 attention kernels are
short enough that CUDA graph launch-overhead savings are negligible. I'll retest at a larger
system size (~5k–10k atoms) where the crossover should occur.

One note for others: verify nsel matches your actual neighbour count at your system density
before re-freezing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models

5 participants