nvcc 13.1 ptxas miscompile on sm_90 drops createpolicy for cp.async cache_hint

We found a CUDA 13.1 ptxas codegen issue while running Marlin on an sm_90 GPU. The symptom is a runtime illegal-instruction trap in the `Marlin<256,1,8,8,4,8>` specialization, hit by the small-batch + groupsize-aware path.

### Symptom

```text
cudaErrorIllegalInstruction (error 716)
========= Illegal instruction
=========     at void Marlin<256,1,8,8,4,8>(...)+0x1180
```

This is not an illegal address or misalignment issue. The failing instruction is an `LDGSTS` generated from inline `cp.async`.

### Root cause

With CUDA Toolkit 13.1:

```text
nvcc V13.1.80
```

ptxas appears to silently drop:

```ptx
createpolicy.fractional.L2::evict_first.b64
```

while keeping dependent:

```ptx
cp.async.cg.shared.global.L2::cache_hint
```

In SASS, the affected `LDGSTS` instructions read an uninitialized uniform register as the L2 cache descriptor:

```text
30 x LDGSTS ... desc[UR1] ...
0 x definition/materialization of UR1
```

So the generated SASS contains `desc[UR1]` users but no corresponding `ULDC` / materialization of `UR1`. At runtime, the hardware traps while decoding that invalid descriptor.

The PTX still contains the expected `createpolicy` instructions, so the C++ -> PTX stage looks correct. The bug seems to happen in ptxas during PTX -> SASS lowering/optimization.

The sibling specialization:

```text
Marlin<256,1,8,8,4,-1>
```

does not show the same issue; its descriptor registers are properly materialized in SASS.

### Repro shape

A minimal Marlin-level repro is:

```python
import torch, marlin

torch.cuda.set_device(0)
dev = "cuda:0"

m, k, n, gs = 1, 4096, 4096, 128

A = torch.randn((m, k), dtype=torch.half, device=dev)
B = torch.randint(
    low=-(2**31),
    high=2**31,
    size=(k * n // 8,),
    dtype=torch.int,
    device=dev,
)
C = torch.zeros((m, n), dtype=torch.half, device=dev)
s = torch.zeros((k // gs, n), dtype=torch.half, device=dev)
workspace = torch.zeros(n // 128 * 16, dtype=torch.int, device=dev)

marlin.mul(A, B, C, s, workspace, -1, -1, -1)
torch.cuda.synchronize()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvcc 13.1 ptxas miscompile on sm_90 drops createpolicy for cp.async cache_hint #44

Symptom

Root cause

Repro shape

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

nvcc 13.1 ptxas miscompile on sm_90 drops createpolicy for cp.async cache_hint #44

Description

Symptom

Root cause

Repro shape

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions