Skip to content

[BUG] CuTe DSL 4.5: cute.domain_offset truncates large dynamic offsets through i32 #3208

@tridao

Description

@tridao

Which component has the problem?

CuTe DSL

Bug Report

cute.domain_offset on a gmem tensor with a large dynamic stride product can compute the offset in i64 but truncate it to i32 before the final GEP in CuTe DSL 4.5.0. This changes the pointer.

The minimal runtime reproducer below is just one direct store through domain_offset. It passes on CuTe DSL 4.4.2 and fails on 4.5.0. I found this out when updating cute dsl to 4.5.0 causes some tests in quack to break.

Reproducer

import torch, cutlass
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

BIG, STRIDE = 2**30, 4

@cute.kernel
def k(t):
    t[0] = cutlass.Uint8(100)

@cute.jit
def launch(view):
    k(cute.domain_offset((BIG,), view)).launch(grid=[1, 1, 1], block=[1, 1, 1])

out = torch.zeros((BIG * STRIDE + 1,), device="cuda", dtype=torch.uint8)
view = torch.as_strided(out, (1,), (STRIDE,))
launch(from_dlpack(view).mark_layout_dynamic())
torch.cuda.synchronize()

print("cutlass", cutlass.__version__, "out[0]", int(out[0].cpu()), "target", int(out[BIG * STRIDE].cpu()))
assert int(out[BIG * STRIDE].cpu()) == 100

view has stride 4, so cute.domain_offset((2**30,), view) should point at out[2**30 * 4]. The kernel writes 100 there. On 4.5.0, the offset wraps to zero and the kernel writes out[0] instead.

Observed

CuTe DSL 4.4.2:

cutlass 4.4.2 out[0] 0 target 100

CuTe DSL 4.5.0:

cutlass 4.5.0 out[0] 100 target 0
AssertionError

Lowering issue

The relevant 4.5.0 host LLVM contains this pattern for cute.domain_offset / _cute_ir.add_offset:

%13 = llvm.mul %12, %5 : i64
%14 = llvm.trunc %13 : i64 to i32
%15 = llvm.sext %14 : i32 to i64
%16 = llvm.getelementptr %10[%15] : (!llvm.ptr<1>, i64) -> !llvm.ptr<1>, ...

The offset is computed as i64 but then narrowed to i32 before the GEP. In this reproducer, 2**30 * 4 == 2**32, so the i32 truncation wraps the pointer adjustment to zero.

Expected: keep the GEP index as i64, or perform the pointer adjustment using i64 byte-address arithmetic.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions