Skip to content

[Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models #5432

@d1l-netizen

Description

@d1l-netizen

Summary

CUDA graph capture of forward_lower fails for DPA-2 models because repformers.forward() calls torch.tensor(0, dtype=..., device='cuda') on every forward pass. This dynamically allocates a new GPU tensor at runtime, which is forbidden during CUDA graph capture (cudaErrorStreamCaptureUnsupported). The result is that CUDA graph optimisation — which eliminates per-kernel-launch overhead and can give ~8–12× speedup on the GPU inference portion — is inaccessible for any DPA-2 model loaded via the LAMMPS C++ plugin.

Detailed Description

Exact error

When attempting to capture forward_lower on a dedicated non-default CUDA stream from DeepPotPT.cc:

RuntimeError: CUDA error: operation not permitted when stream is capturing
Search for `cudaErrorStreamCaptureUnsupported'

TorchScript traceback:
  File "repformers.py", line 151, in forward
    _13 = torch.tensor(0, dtype=ops.prim.dtype(nlist0),
                          device=ops.prim.device(nlist0))
          ~~~~~~~~~~~~ <--- HERE
    _14 = annotate(List[Optional[Tensor]], [_12])
    _15 = torch.index_put_(nlist0, _14, _13)

Root cause

In deepmd/pt/model/descriptor/repformers.py:

# Runs on every forward pass — allocates a new GPU tensor each call:
zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
nlist = torch.index_put_(nlist, [nlist == -1], zero)

torch.tensor(scalar, device='cuda') triggers a GPU memory allocation requiring CPU↔GPU synchronisation, which is explicitly disallowed during CUDA graph capture. Because this is baked into the frozen TorchScript IR of the .pth file, it cannot be patched from the C++ plugin side.


Proposed fix

Replace the dynamic allocation with a persistent module buffer registered at construction:

# In Repformers.__init__:
self.register_buffer('_zero_idx', torch.zeros(1, dtype=torch.long))

# In forward, replace:
#   zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
zero = self._zero_idx.to(dtype=nlist.dtype)   # fixed address, no allocation
nlist = torch.index_put_(nlist, [nlist == -1], zero)

register_buffer allocates the tensor once at model load with a fixed GPU address. The index_put_ call then becomes CUDA-graph-capturable.

The same pattern should be audited across all descriptor layers for any other torch.tensor(scalar, device='cuda') calls that run inside forward.


Impact

CUDA graphs are a standard PyTorch optimisation for repeated fixed-topology inference. Removing this incompatibility would allow:

  • Capture of forward_lower in DeepPotPT.cc on a dedicated stream
  • Replay with only a coordinate copy_ per step instead of full kernel re-launch
  • Estimated ~8–12× reduction in GPU inference time per MD step for message-passing models

Non-message-passing models (e.g. se_e2_a) do not have this issue and are already CUDA-graph-capturable.


Additional note: comm_dict path

Even after the above fix, it would be worth verifying that the full forward_lower call with comm_dict (used for ghost atom coordination on single-GPU LAMMPS runs) is fully CUDA-graph-capturable, as it may contain additional CPU↔GPU synchronisation points.


References

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions