Skip to content

nvcc 13.1 ptxas miscompile on sm_90 drops createpolicy for cp.async cache_hint #44

@huangzhilin-hzl

Description

@huangzhilin-hzl

We found a CUDA 13.1 ptxas codegen issue while running Marlin on an sm_90 GPU. The symptom is a runtime illegal-instruction trap in the Marlin<256,1,8,8,4,8> specialization, hit by the small-batch + groupsize-aware path.

Symptom

cudaErrorIllegalInstruction (error 716)
========= Illegal instruction
=========     at void Marlin<256,1,8,8,4,8>(...)+0x1180

This is not an illegal address or misalignment issue. The failing instruction is an LDGSTS generated from inline cp.async.

Root cause

With CUDA Toolkit 13.1:

nvcc V13.1.80

ptxas appears to silently drop:

createpolicy.fractional.L2::evict_first.b64

while keeping dependent:

cp.async.cg.shared.global.L2::cache_hint

In SASS, the affected LDGSTS instructions read an uninitialized uniform register as the L2 cache descriptor:

30 x LDGSTS ... desc[UR1] ...
0 x definition/materialization of UR1

So the generated SASS contains desc[UR1] users but no corresponding ULDC / materialization of UR1. At runtime, the hardware traps while decoding that invalid descriptor.

The PTX still contains the expected createpolicy instructions, so the C++ -> PTX stage looks correct. The bug seems to happen in ptxas during PTX -> SASS lowering/optimization.

The sibling specialization:

Marlin<256,1,8,8,4,-1>

does not show the same issue; its descriptor registers are properly materialized in SASS.

Repro shape

A minimal Marlin-level repro is:

import torch, marlin

torch.cuda.set_device(0)
dev = "cuda:0"

m, k, n, gs = 1, 4096, 4096, 128

A = torch.randn((m, k), dtype=torch.half, device=dev)
B = torch.randint(
    low=-(2**31),
    high=2**31,
    size=(k * n // 8,),
    dtype=torch.int,
    device=dev,
)
C = torch.zeros((m, n), dtype=torch.half, device=dev)
s = torch.zeros((k // gs, n), dtype=torch.half, device=dev)
workspace = torch.zeros(n // 128 * 16, dtype=torch.int, device=dev)

marlin.mul(A, B, C, s, workspace, -1, -1, -1)
torch.cuda.synchronize()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions