Which component has the problem?
CuTe DSL
Bug Report
cute.domain_offset on a gmem tensor with a large dynamic stride product can compute the offset in i64 but truncate it to i32 before the final GEP in CuTe DSL 4.5.0. This changes the pointer.
The minimal runtime reproducer below is just one direct store through domain_offset. It passes on CuTe DSL 4.4.2 and fails on 4.5.0. I found this out when updating cute dsl to 4.5.0 causes some tests in quack to break.
Reproducer
import torch, cutlass
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
BIG, STRIDE = 2**30, 4
@cute.kernel
def k(t):
t[0] = cutlass.Uint8(100)
@cute.jit
def launch(view):
k(cute.domain_offset((BIG,), view)).launch(grid=[1, 1, 1], block=[1, 1, 1])
out = torch.zeros((BIG * STRIDE + 1,), device="cuda", dtype=torch.uint8)
view = torch.as_strided(out, (1,), (STRIDE,))
launch(from_dlpack(view).mark_layout_dynamic())
torch.cuda.synchronize()
print("cutlass", cutlass.__version__, "out[0]", int(out[0].cpu()), "target", int(out[BIG * STRIDE].cpu()))
assert int(out[BIG * STRIDE].cpu()) == 100
view has stride 4, so cute.domain_offset((2**30,), view) should point at out[2**30 * 4]. The kernel writes 100 there. On 4.5.0, the offset wraps to zero and the kernel writes out[0] instead.
Observed
CuTe DSL 4.4.2:
cutlass 4.4.2 out[0] 0 target 100
CuTe DSL 4.5.0:
cutlass 4.5.0 out[0] 100 target 0
AssertionError
Lowering issue
The relevant 4.5.0 host LLVM contains this pattern for cute.domain_offset / _cute_ir.add_offset:
%13 = llvm.mul %12, %5 : i64
%14 = llvm.trunc %13 : i64 to i32
%15 = llvm.sext %14 : i32 to i64
%16 = llvm.getelementptr %10[%15] : (!llvm.ptr<1>, i64) -> !llvm.ptr<1>, ...
The offset is computed as i64 but then narrowed to i32 before the GEP. In this reproducer, 2**30 * 4 == 2**32, so the i32 truncation wraps the pointer adjustment to zero.
Expected: keep the GEP index as i64, or perform the pointer adjustment using i64 byte-address arithmetic.
Which component has the problem?
CuTe DSL
Bug Report
cute.domain_offseton a gmem tensor with a large dynamic stride product can compute the offset in i64 but truncate it to i32 before the final GEP in CuTe DSL 4.5.0. This changes the pointer.The minimal runtime reproducer below is just one direct store through
domain_offset. It passes on CuTe DSL 4.4.2 and fails on 4.5.0. I found this out when updating cute dsl to 4.5.0 causes some tests in quack to break.Reproducer
viewhas stride 4, socute.domain_offset((2**30,), view)should point atout[2**30 * 4]. The kernel writes 100 there. On 4.5.0, the offset wraps to zero and the kernel writesout[0]instead.Observed
CuTe DSL 4.4.2:
CuTe DSL 4.5.0:
Lowering issue
The relevant 4.5.0 host LLVM contains this pattern for
cute.domain_offset/_cute_ir.add_offset:The offset is computed as i64 but then narrowed to i32 before the GEP. In this reproducer,
2**30 * 4 == 2**32, so the i32 truncation wraps the pointer adjustment to zero.Expected: keep the GEP index as i64, or perform the pointer adjustment using i64 byte-address arithmetic.