[Feature Request] `torch.tensor()` inside `repformers.forward()` prevents CUDA graph capture for DPA-2 models

### Summary

CUDA graph capture of `forward_lower` fails for DPA-2 models because `repformers.forward()` calls `torch.tensor(0, dtype=..., device='cuda')` on every forward pass. This dynamically allocates a new GPU tensor at runtime, which is forbidden during CUDA graph capture (`cudaErrorStreamCaptureUnsupported`). The result is that CUDA graph optimisation — which eliminates per-kernel-launch overhead and can give ~8–12× speedup on the GPU inference portion — is inaccessible for any DPA-2 model loaded via the LAMMPS C++ plugin.

### Detailed Description

**Exact error**

When attempting to capture `forward_lower` on a dedicated non-default CUDA stream from `DeepPotPT.cc`:

```
RuntimeError: CUDA error: operation not permitted when stream is capturing
Search for `cudaErrorStreamCaptureUnsupported'

TorchScript traceback:
  File "repformers.py", line 151, in forward
    _13 = torch.tensor(0, dtype=ops.prim.dtype(nlist0),
                          device=ops.prim.device(nlist0))
          ~~~~~~~~~~~~ <--- HERE
    _14 = annotate(List[Optional[Tensor]], [_12])
    _15 = torch.index_put_(nlist0, _14, _13)
```

---

**Root cause**

In `deepmd/pt/model/descriptor/repformers.py`:

```python
# Runs on every forward pass — allocates a new GPU tensor each call:
zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
nlist = torch.index_put_(nlist, [nlist == -1], zero)
```

`torch.tensor(scalar, device='cuda')` triggers a GPU memory allocation requiring CPU↔GPU synchronisation, which is explicitly disallowed during CUDA graph capture. Because this is baked into the frozen TorchScript IR of the `.pth` file, it cannot be patched from the C++ plugin side.

---

**Proposed fix**

Replace the dynamic allocation with a persistent module buffer registered at construction:

```python
# In Repformers.__init__:
self.register_buffer('_zero_idx', torch.zeros(1, dtype=torch.long))

# In forward, replace:
#   zero = torch.tensor(0, dtype=nlist.dtype, device=nlist.device)
zero = self._zero_idx.to(dtype=nlist.dtype)   # fixed address, no allocation
nlist = torch.index_put_(nlist, [nlist == -1], zero)
```

`register_buffer` allocates the tensor once at model load with a fixed GPU address. The `index_put_` call then becomes CUDA-graph-capturable.

The same pattern should be audited across all descriptor layers for any other `torch.tensor(scalar, device='cuda')` calls that run inside `forward`.

---

**Impact**

CUDA graphs are a standard PyTorch optimisation for repeated fixed-topology inference. Removing this incompatibility would allow:

- Capture of `forward_lower` in `DeepPotPT.cc` on a dedicated stream
- Replay with only a coordinate `copy_` per step instead of full kernel re-launch
- Estimated ~8–12× reduction in GPU inference time per MD step for message-passing models

Non-message-passing models (e.g. `se_e2_a`) do not have this issue and are already CUDA-graph-capturable.

---

**Additional note: `comm_dict` path**

Even after the above fix, it would be worth verifying that the full `forward_lower` call with `comm_dict` (used for ghost atom coordination on single-GPU LAMMPS runs) is fully CUDA-graph-capturable, as it may contain additional CPU↔GPU synchronisation points.

---

**References**

- PyTorch CUDA Graphs: https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs
- `cudaErrorStreamCaptureUnsupported`: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html
- `torch.cuda.make_graphed_callables` requirements: all operations inside the callable must be graph-compatible

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] `torch.tensor()` inside `repformers.forward()` prevents CUDA graph capture for DPA-2 models #5432

Summary

Detailed Description

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] torch.tensor() inside repformers.forward() prevents CUDA graph capture for DPA-2 models #5432

Description

Summary

Detailed Description

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature Request] `torch.tensor()` inside `repformers.forward()` prevents CUDA graph capture for DPA-2 models #5432