We found a CUDA 13.1 ptxas codegen issue while running Marlin on an sm_90 GPU. The symptom is a runtime illegal-instruction trap in the Marlin<256,1,8,8,4,8> specialization, hit by the small-batch + groupsize-aware path.
Symptom
cudaErrorIllegalInstruction (error 716)
========= Illegal instruction
========= at void Marlin<256,1,8,8,4,8>(...)+0x1180
This is not an illegal address or misalignment issue. The failing instruction is an LDGSTS generated from inline cp.async.
Root cause
With CUDA Toolkit 13.1:
ptxas appears to silently drop:
createpolicy.fractional.L2::evict_first.b64
while keeping dependent:
cp.async.cg.shared.global.L2::cache_hint
In SASS, the affected LDGSTS instructions read an uninitialized uniform register as the L2 cache descriptor:
30 x LDGSTS ... desc[UR1] ...
0 x definition/materialization of UR1
So the generated SASS contains desc[UR1] users but no corresponding ULDC / materialization of UR1. At runtime, the hardware traps while decoding that invalid descriptor.
The PTX still contains the expected createpolicy instructions, so the C++ -> PTX stage looks correct. The bug seems to happen in ptxas during PTX -> SASS lowering/optimization.
The sibling specialization:
does not show the same issue; its descriptor registers are properly materialized in SASS.
Repro shape
A minimal Marlin-level repro is:
import torch, marlin
torch.cuda.set_device(0)
dev = "cuda:0"
m, k, n, gs = 1, 4096, 4096, 128
A = torch.randn((m, k), dtype=torch.half, device=dev)
B = torch.randint(
low=-(2**31),
high=2**31,
size=(k * n // 8,),
dtype=torch.int,
device=dev,
)
C = torch.zeros((m, n), dtype=torch.half, device=dev)
s = torch.zeros((k // gs, n), dtype=torch.half, device=dev)
workspace = torch.zeros(n // 128 * 16, dtype=torch.int, device=dev)
marlin.mul(A, B, C, s, workspace, -1, -1, -1)
torch.cuda.synchronize()
We found a CUDA 13.1 ptxas codegen issue while running Marlin on an sm_90 GPU. The symptom is a runtime illegal-instruction trap in the
Marlin<256,1,8,8,4,8>specialization, hit by the small-batch + groupsize-aware path.Symptom
This is not an illegal address or misalignment issue. The failing instruction is an
LDGSTSgenerated from inlinecp.async.Root cause
With CUDA Toolkit 13.1:
ptxas appears to silently drop:
while keeping dependent:
In SASS, the affected
LDGSTSinstructions read an uninitialized uniform register as the L2 cache descriptor:So the generated SASS contains
desc[UR1]users but no correspondingULDC/ materialization ofUR1. At runtime, the hardware traps while decoding that invalid descriptor.The PTX still contains the expected
createpolicyinstructions, so the C++ -> PTX stage looks correct. The bug seems to happen in ptxas during PTX -> SASS lowering/optimization.The sibling specialization:
does not show the same issue; its descriptor registers are properly materialized in SASS.
Repro shape
A minimal Marlin-level repro is: