fix(pt): replace in-place nlist masking to enable CUDA graph capture by wanghan-iapcm · Pull Request #5433 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-05-06T10:36:10Z

Summary

Replace nlist[nlist == -1] = 0 in DescrptBlockRepformers.forward (DPA-2) and DescrptBlockRepflows.forward (DPA-3, two occurrences) with nlist = torch.where(nlist == -1, 0, nlist).
TorchScript was lowering the indexed-assignment into an index_put_ whose value was torch.tensor(0, device=...) — a per-call scalar allocation that triggers a CPU↔GPU sync and is forbidden during CUDA stream capture (cudaErrorStreamCaptureUnsupported). This made forward_lower of DPA-2 / DPA-3 models non-capturable from the LAMMPS C++ plugin, blocking the ~8–12× CUDA-graph speedup reported in [Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models #5432.
aten::where with a Python scalar embeds the literal in the kernel without allocating a device tensor, so the resulting frozen IR is graph-capturable. The pattern matches what se_atten.py and se_t_tebd.py already use.

Notes

nlist / a_nlist were already rebound to local copies via torch.where(exclude_mask != 0, ...) earlier in both functions, so dropping the in-place mutation has no observable effect on callers.
Existing deployed .pth files won't benefit until re-exported (their TorchScript IR is frozen).
The reporter also flagged that the comm_dict ghost-atom path in forward_lower may still contain other CPU↔GPU sync points needed for full LAMMPS capture; that audit is out of scope here.

Closes #5432.

Test plan

pytest source/tests/pt/model/test_dpa3.py — passes (exercises torch.jit.script on the DPA-3 descriptor → covers patched repflows.py lines)
pytest source/tests/pt/model/test_dpa2.py — passes (numerical consistency of DPA-2 descriptor → covers patched repformers.py line)
Inline torch.jit.script(DescrptDPA2(...)) succeeds (verifies new torch.where(... == -1, 0, nlist) form scripts cleanly)
CI: full test_jit.py and consistency suites

Summary by CodeRabbit

Refactor
- Optimized internal tensor operations in descriptor modules for improved performance and consistency.

`nlist[nlist == -1] = 0` in DescrptBlockRepformers.forward and DescrptBlockRepflows.forward is lowered by TorchScript into an `index_put_` whose value is built by `torch.tensor(0, device=...)` on every forward pass. The per-call scalar allocation triggers a CPU->GPU sync, which is forbidden during CUDA stream capture (`cudaErrorStreamCaptureUnsupported`), making `forward_lower` of DPA-2 and DPA-3 models non-capturable from the LAMMPS C++ plugin. Switch to `nlist = torch.where(nlist == -1, 0, nlist)`, matching the pattern already used in se_atten.py and se_t_tebd.py. `aten::where` with a Python scalar takes the literal through the kernel without allocating a device tensor, so the resulting frozen IR is CUDA-graph capturable. `nlist`/`a_nlist` were already rebound to local copies earlier in both functions, so dropping the in-place mutation has no observable effect on callers. Closes deepmodeling#5432.

coderabbitai · 2026-05-06T10:43:50Z

📝 Walkthrough

Walkthrough

Padding sentinel handling in RepFlow and Repformers descriptor forward methods is refactored from in-place index assignment to vectorized torch.where operations to replace -1 padding indices with 0, addressing CUDA graph capture incompatibility.

Changes

Padding Sentinel Replacement in Descriptors

Layer / File(s)	Summary
Core Implementation `deepmd/pt/model/descriptor/repflows.py`, `deepmd/pt/model/descriptor/repformers.py`	Padding indices marked with -1 are now replaced with 0 using `torch.where(nlist == -1, 0, nlist)` instead of in-place assignment (`nlist[nlist == -1] = 0`) or `torch.index_put_()` with dynamically allocated tensors. Applied to both `nlist` and `a_nlist` in repflows, and `nlist` in repformers.

Sequence Diagram

sequenceDiagram
    participant Caller as DPA-2 Model
    participant Descriptor as Descriptor Forward
    participant OldPath as Old: torch.tensor() + index_put_()
    participant NewPath as New: torch.where()
    participant CUDA as CUDA Graph Capture

    Caller->>Descriptor: forward_lower()
    Descriptor->>NewPath: Process nlist with torch.where()
    NewPath->>NewPath: Replace -1 → 0 (no dynamic allocation)
    NewPath-->>CUDA: Compatible with stream capture
    CUDA-->>Caller: Graph cached & replayed

    Note over OldPath: Previous: torch.tensor(0, device='cuda')<br/>caused cudaErrorStreamCaptureUnsupported<br/>on every forward pass

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: replacing in-place nlist masking with torch.where to enable CUDA graph capture. It is concise, specific, and directly related to the primary objective of the changeset.
Linked Issues check	✅ Passed	The PR replaces in-place indexed-assignment patterns with torch.where in both repformers.py and repflows.py as required by issue `#5432`, removing dynamic GPU scalar allocations to enable CUDA graph capture.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the CUDA graph capture issue by replacing in-place masking in descriptor modules; no unrelated modifications are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

deepmd/pt/model/descriptor/repflows.py (1)

619-637: Remaining per-call torch.tensor() allocations in the comm_dict path.

The border_op call sites (lines 627–636) still allocate new CPU tensors (torch.tensor(real_nloc, dtype=torch.int32, device="cpu") and torch.tensor(real_nall - real_nloc, ...)) on every forward pass. These live on CPU, so they don't produce cudaErrorStreamCaptureUnsupported today, but they are still per-call host allocations inside the training loop. If CUDA graph capture of the parallel/comm_dict path is ever needed, these will be the next sync point to address (e.g., pre-allocating a torch.zeros(1, dtype=torch.int32) buffer and filling it once, or passing Python ints directly if the op accepts them).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2fd617c5-c738-4cdc-be39-55e791b6e3d9

📥 Commits

Reviewing files that changed from the base of the PR and between 0a481de and 0e3fcc6.

📒 Files selected for processing (2)

deepmd/pt/model/descriptor/repflows.py
deepmd/pt/model/descriptor/repformers.py

codecov · 2026-05-06T11:31:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.47%. Comparing base (0a481de) to head (0e3fcc6).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5433   +/-   ##
=======================================
  Coverage   82.47%   82.47%           
=======================================
  Files         825      825           
  Lines       87721    87721           
  Branches     4206     4207    +1     
=======================================
  Hits        72344    72344           
- Misses      14093    14094    +1     
+ Partials     1284     1283    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz-bot

LGTM. I checked the diff: this only replaces in-place boolean-index assignment on nlist/a_nlist with torch.where, which avoids mutating tensors during CUDA graph capture while preserving the effective masking behavior before downstream descriptor use. No compatibility concern from my side.

Note: I could not run local uv run ruff because the current uv dependency resolution fails on mutually pinned torch CPU/GPU extras in this environment, so this approval is based on diff inspection and the PR CI status.

Reviewed by OpenClaw 2026.4.22 (00bd2cf) (model: gpt-5.5).
Authored by OpenClaw (model: gpt-5.5)

ganfisher · 2026-05-07T06:06:03Z

I wonder whether this bug is related to #5087, since I encountered the same problem with DPA3

d1l-netizen · 2026-05-07T15:34:26Z

Thank you for the suggested changes! I applied the patches to repformers.py and reflow.py,
re-froze the model with a fixed nsel for static tensor shapes, and ran a LAMMPS smoke test.

The CUDA graph model loads and runs correctly — libcudart.so.12 loads successfully and the run
completes without errors.

Performance on a small test system (~700 atoms, L40S GPU):

Baseline (eager): ~10.7 steps/s
CUDA graph: ~10.4 steps/s (~0.97×, essentially no change)

No speedup at this system size, which is expected — at ~700 atoms the DPA-2 attention kernels are
short enough that CUDA graph launch-overhead savings are negligible. I'll retest at a larger
system size (~5k–10k atoms) where the crossover should occur.

One note for others: verify nsel matches your actual neighbour count at your system density
before re-freezing.

github-actions Bot added the Python label May 6, 2026

dosubot Bot added the bug label May 6, 2026

wanghan-iapcm requested a review from njzjz May 6, 2026 10:37

wanghan-iapcm mentioned this pull request May 6, 2026

[Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models #5432

Closed

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

njzjz-bot approved these changes May 6, 2026

View reviewed changes

njzjz approved these changes May 7, 2026

View reviewed changes

njzjz added this pull request to the merge queue May 7, 2026

Merged via the queue into deepmodeling:master with commit 5bd0889 May 7, 2026
73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt): replace in-place nlist masking to enable CUDA graph capture#5433

fix(pt): replace in-place nlist masking to enable CUDA graph capture#5433
njzjz merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pt-cuda-graph-nlist-mask

wanghan-iapcm commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented May 6, 2026 •

edited

Loading

Uh oh!

njzjz-bot left a comment

Uh oh!

ganfisher commented May 7, 2026

Uh oh!

Uh oh!

d1l-netizen commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

wanghan-iapcm commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ganfisher commented May 7, 2026

Uh oh!

Uh oh!

d1l-netizen commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wanghan-iapcm commented May 6, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented May 6, 2026 •

edited

Loading