Summary
CUDA graph capture of forward_lower fails for DPA-2 models because repformers.forward() calls torch.tensor(0, dtype=..., device='cuda') on every forward pass. This dynamically allocates a new GPU tensor at runtime, which is forbidden during CUDA graph capture (cudaErrorStreamCaptureUnsupported). The result is that CUDA graph optimisation — which eliminates per-kernel-launch overhead and can give ~8–12× speedup on the GPU inference portion — is inaccessible for any DPA-2 model loaded via the LAMMPS C++ plugin.
Detailed Description
Exact error
When attempting to capture forward_lower on a dedicated non-default CUDA stream from DeepPotPT.cc:
RuntimeError: CUDA error: operation not permitted when stream is capturing
Search for `cudaErrorStreamCaptureUnsupported'
TorchScript traceback:
File "repformers.py", line 151, in forward
_13 = torch.tensor(0, dtype=ops.prim.dtype(nlist0),
device=ops.prim.device(nlist0))
~~~~~~~~~~~~ <--- HERE
_14 = annotate(List[Optional[Tensor]], [_12])
_15 = torch.index_put_(nlist0, _14, _13)
Root cause
In deepmd/pt/model/descriptor/repformers.py:
# Runs on every forward pass — allocates a new GPU tensor each call:
zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
nlist = torch.index_put_(nlist, [nlist == -1], zero)
torch.tensor(scalar, device='cuda') triggers a GPU memory allocation requiring CPU↔GPU synchronisation, which is explicitly disallowed during CUDA graph capture. Because this is baked into the frozen TorchScript IR of the .pth file, it cannot be patched from the C++ plugin side.
Proposed fix
Replace the dynamic allocation with a persistent module buffer registered at construction:
# In Repformers.__init__:
self.register_buffer('_zero_idx', torch.zeros(1, dtype=torch.long))
# In forward, replace:
# zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
zero = self._zero_idx.to(dtype=nlist.dtype) # fixed address, no allocation
nlist = torch.index_put_(nlist, [nlist == -1], zero)
register_buffer allocates the tensor once at model load with a fixed GPU address. The index_put_ call then becomes CUDA-graph-capturable.
The same pattern should be audited across all descriptor layers for any other torch.tensor(scalar, device='cuda') calls that run inside forward.
Impact
CUDA graphs are a standard PyTorch optimisation for repeated fixed-topology inference. Removing this incompatibility would allow:
- Capture of
forward_lower in DeepPotPT.cc on a dedicated stream
- Replay with only a coordinate
copy_ per step instead of full kernel re-launch
- Estimated ~8–12× reduction in GPU inference time per MD step for message-passing models
Non-message-passing models (e.g. se_e2_a) do not have this issue and are already CUDA-graph-capturable.
Additional note: comm_dict path
Even after the above fix, it would be worth verifying that the full forward_lower call with comm_dict (used for ghost atom coordination on single-GPU LAMMPS runs) is fully CUDA-graph-capturable, as it may contain additional CPU↔GPU synchronisation points.
References
Further Information, Files, and Links
No response
Summary
CUDA graph capture of
forward_lowerfails for DPA-2 models becauserepformers.forward()callstorch.tensor(0, dtype=..., device='cuda')on every forward pass. This dynamically allocates a new GPU tensor at runtime, which is forbidden during CUDA graph capture (cudaErrorStreamCaptureUnsupported). The result is that CUDA graph optimisation — which eliminates per-kernel-launch overhead and can give ~8–12× speedup on the GPU inference portion — is inaccessible for any DPA-2 model loaded via the LAMMPS C++ plugin.Detailed Description
Exact error
When attempting to capture
forward_loweron a dedicated non-default CUDA stream fromDeepPotPT.cc:Root cause
In
deepmd/pt/model/descriptor/repformers.py:torch.tensor(scalar, device='cuda')triggers a GPU memory allocation requiring CPU↔GPU synchronisation, which is explicitly disallowed during CUDA graph capture. Because this is baked into the frozen TorchScript IR of the.pthfile, it cannot be patched from the C++ plugin side.Proposed fix
Replace the dynamic allocation with a persistent module buffer registered at construction:
register_bufferallocates the tensor once at model load with a fixed GPU address. Theindex_put_call then becomes CUDA-graph-capturable.The same pattern should be audited across all descriptor layers for any other
torch.tensor(scalar, device='cuda')calls that run insideforward.Impact
CUDA graphs are a standard PyTorch optimisation for repeated fixed-topology inference. Removing this incompatibility would allow:
forward_lowerinDeepPotPT.ccon a dedicated streamcopy_per step instead of full kernel re-launchNon-message-passing models (e.g.
se_e2_a) do not have this issue and are already CUDA-graph-capturable.Additional note:
comm_dictpathEven after the above fix, it would be worth verifying that the full
forward_lowercall withcomm_dict(used for ghost atom coordination on single-GPU LAMMPS runs) is fully CUDA-graph-capturable, as it may contain additional CPU↔GPU synchronisation points.References
cudaErrorStreamCaptureUnsupported: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.htmltorch.cuda.make_graphed_callablesrequirements: all operations inside the callable must be graph-compatibleFurther Information, Files, and Links
No response